DeepCOI: a large language model-driven framework for fast and accurate taxonomic assignment in animal metabarcoding.
Ho-Jin Gwak, Mina Rho
Abstract
Open AccessMetabarcoding remains challenging due to incomplete taxonomic annotations and computationally intensive processes. We present DeepCOI, a large language model-based classifier pre-trained on seven million cytochrome c oxidase I gene sequences. DeepCOI enables fast and accurate taxonomic assignment across eight major phyla, achieving an AU-ROC of 0.958 and AU-PR of 0.897-outperforming existing methods while significantly reducing inference time. Additionally, DeepCOI demonstrates interpretability by identifying taxonomically informative sequence positions. By integrating large-scale datasets and self-supervised learning, DeepCOI enhances both the accuracy and efficiency of metabarcoding processes, providing a scalable solution for biodiversity assessment and environmental monitoring.