PeptideMTR: Scaling SMILES-Based Language Models for Therapeutic Peptide Engineering.
Aaron L Feller, Maxim Secor, Sebastian Swanson, Claus O Wilke, Kristine Deibler
Abstract
Open AccessTherapeutic peptides occupy a unique region of chemical space, combining the modularity of proteins with the versatility of small molecules. However, existing foundation models struggle to represent this domain: protein language models are confined to canonical residues, while small-molecule models often lack the contextual range required for peptide sequences. Here we introduce PeptideMTR, a suite of nine SMILES-based chemical language models (32M-337M parameters) pretrained on peptide and small molecule data with three objectives: masked language modeling, multi-task regression to physicochemical descriptors, and a combined objective. Systematic evaluation on membrane permeability prediction reveals a distinct scaling transition: at smaller scales, descriptor-guided pretraining provides a crucial inductive bias, grounding embeddings in physicochemical space; however, as capacity increases, purely self-supervised models recover equivalent predictive capability, suggesting that large models spontaneously internalize these physicochemical priors. PeptideMTR outperforms molecular fingerprints and specialized architectures on diverse peptide tasks including aggregation propensity, tumor homing, cell penetration, and antimicrobial activity. We release PeptideMTR as an open, scalable resource for therapeutic peptide engineering.