Evaluating Pretrained Protein Language Model Embeddings as Proxies for Functional Similarity.
Robert Shaw, Samuel D Love, Claire D McWhite
Abstract
Open AccessProtein Language Models (PLMs) have emerged as powerful tools for representing protein sequences. We explore how embeddings (numeric vector representations) from pretrained PLMs can serve as direct numeric proxies for protein structure and function without requiring additional training or fine-tuning. In a proof-of-concept study of 22 cross-species complementation triplets-a gold standard for functional similarity where genes from one species are tested for their ability to rescue gene deletions in another species-we find that ESM-C 600 M embeddings summarized into pooled sliced-Wasserstein embeddings achieved high discrimination of subtle functional differences. This pooling method captures distributional properties of amino acid embeddings by comparing them against reference points using optimal transport theory. While our limited sample size precludes definitive conclusions about whether PLM embeddings systematically outperform sequence-based methods in detecting protein functional similarity, our preliminary results demonstrate the potential of using protein embeddings for functional analysis. Our exploratory analysis of orthology relationships suggests that embedding similarity may correlate with functional conservation, with the least diverged ortholog showing higher embedding similarity in approximately two-thirds of cases. Analyzing the Ortholog Conjecture-that orthologs maintain greater functional similarity than paralogs at equivalent sequence divergence-we do not observe clear differences between one-to-one orthologs and inparalog embedding similarities. Finally, we propose integrating PLMs with phylogenetic methods in a hybrid approach that leverages their complementary strengths: PLM-derived numeric embeddings for rapid homology detection and phylogenetics for evolutionary precision. We introduce embedding-tree versus gene-tree discordance as a potential metric to detect functional divergence between closely related proteins. Integrating protein embeddings with sequence analysis may enable a more nuanced understanding of protein function and evolutionary dynamics.