SynVectorDB: embedding-based retrieval system for synthetic biology parts.
Hao Li, Jiani Hu, Jie Song, Wei Zhou
Abstract
Open AccessSynthetic biology part discovery faces significant challenges due to inconsistent data organization and limited semantic search capabilities across existing repositories. We developed SynVectorDB, an embedding-based retrieval system that addresses these limitations through methodological innovations in data integration and AI-driven semantic search. Our approach integrates 19 850 biological parts from multiple sources (Addgene, iGEM Registry, laboratory collections), implementing systematic curation protocols that resulted in 7656 parts achieving verified status through literature-based validation and reliability assessment. We introduce a novel three-level hierarchical classification system organizing parts into functionally coherent categories (DNA Elements, RNA Elements, Coding Sequences, and Application Constructs) with detailed subcategorization. The core technical contribution employs BGE-M3 multilingual embeddings within a scalable vector database architecture to enable semantic similarity matching that significantly outperforms keyword-based retrieval methods. Standardized curation workflows enhance data comparability and search accuracy across heterogeneous sources. The dual deployment architecture ensures high performance through cloud services while maintaining open-source accessibility and deployment flexibility. The system maintains SBOL3 compatibility while providing innovative solutions for biological part organization and retrieval. Database URL: SynVectorDB is available in multiple deployment modes: web interface (https://svdb.sjtu.bio), local installation and source code (https://github.com/AilurusBio/synbio-parts-db), and MCP server integration for AI assistants (https://www.npmjs.com/package/synvectordb).