Communications chemistry

A large language model for deriving spectral embeddings for accurate compound identification in mass spectrometry.

Yang Xu, Yixiao Ma, Weijie Xu, Zuliang Yang, Kai Ming Ting

Published: 202510.1038/s42004-025-01708-7

Abstract

Open Access

Identifying chemical components in complex mixtures is a crucial task across many scientific disciplines. Mass spectrometry serves as a key analytical tool for this purpose, yet the accurate identification of compounds from their spectra remains a major bottleneck. Here we introduce LLM4MS, a method that leverages the latent expert knowledge within large language models to generate discriminative spectral embeddings for improved compound identification. LLM4MS is designed to incorporate potential chemical expert knowledge, enabling accurate matching. Evaluated against a million-scale open-source in-silico library using the NIST23 library as a test set, LLM4MS achieves a Recall@1 accuracy of 66.3% (and a Recall@10 accuracy of 92.7%), representing a 13.7% improvement over the state-of-the-art Spec2Vec. Furthermore, LLM4MS enables ultra-fast mass spectra matching, achieving a speed of nearly 15,000 queries per second. Thus, LLM4MS opens up avenues to significantly enhance compound identification in mass spectrometry and accelerate chemical discovery.

View at DOI