GeneCAD: Plant Genome Annotation with a DNA Foundation Model.
Zong-Yan Liu, Ana Berthel, Eric Czech, Michelle Stitzer, Sheng-Kai Hsu, Matt Pennell, Edward S Buckler, Jingjing Zhai
Abstract
Open AccessAccurate genome annotation remains a bottleneck in plants, where polyploidy and repeat-rich sequence confound homology- and RNA-based pipelines. We introduce GeneCAD, a sequence-only method that predicts complete plant gene models directly from DNA. GeneCAD couples representations from a plant DNA foundation model, PlantCAD2, with a lightweight ModernBERT encoder and a chromosome-wide conditional random field that enforces splice-phase and feature order, and applies a protein language-model screen to suppress repeat-driven open reading frames. To limit label noise, we rank and filter public annotations using a sequence-based masked-motif score and fine-tune on five phylogenetically diverse, high-quality references. Across five held out angiosperms, including the allotetraploid Nicotiana tabacum, GeneCAD improves transcript-level F1 by 8-10% on average over Helixer and BRAKER3, increases exact match transcripts, and sharpens boundaries at start/stop codons and splice junctions. By removing dependence on species-matched RNA-seq or proteomics while retaining cross-species accuracy, GeneCAD provides an accurate, scalable route to biologically coherent plant gene models from DNA alone.