Candidate orphan genes: Reassessing uniqueness.
Sheetalpreet K Maan, Xuan Y Butzin, Steven X Ge, Nicholas C Butzin
Abstract
Open AccessOrphan genes lack recognizable homologues outside a given taxonomic unit; thus, they have uncertain evolutionary origins. This presents a profound challenge to traditional models of gene evolution. Their presence has fueled ongoing debate, and they have long been implicated in driving lineage-specific traits in medicine and evolutionary biology. These genes are often linked to species-specific traits and pathogenic mechanisms, including virulence and environmental adaptation, and their study provides critical insights into the origins and evolution of novel genes. Intrigued by their enigmatic nature, we re-analyzed a comprehensive 2023 dataset of orphan genes compiled from over 80,000 bacterial species. Using homology-based analyses, we reassessed the taxonomic distribution of each gene across a broader genomic landscape. Many "orphan genes" identified in 2023 now align with homologs in other bacterial taxa (as of 2025), demonstrating that limited database sampling had previously inflated the number of genes. This reassessment revealed an approximately 81% decrease in the number of orphan genes within just two years. These results challenged the long-held view that bacterial species truly harbor large numbers of orphan genes, instead demonstrating that their prevalence has been overestimated. To better reflect these findings, we propose that orphan genes be annotated using descriptors such as 'candidate' or 'putative', which more accurately convey the provisional and potentially temporary nature of their apparent uniqueness. Although our analysis greatly reduced false-positive classifications, it cannot determine whether a given candidate truly encodes a functional gene or is an artifact of bioinformatic analysis. To prioritize the most promising targets for biochemical or genetic validation, we applied additional computational filters and identified a subset of candidates most likely to encode bona fide proteins. This study redefines current understanding of orphan gene prevalence, establishes that such genes should be annotated with descriptors such as candidate or putative, much like we label "candidate bacterial species," and provides a refined, high-confidence dataset for future in vitro and in vivo investigations.