SARST2 high-throughput and resource-efficient protein structure alignment against massive databases.
Wei-Cheng Lo, Arieh Warshel, Chia-Hua Lo, Chia Yee Choke, Yan-Jie Li, Shih-Chung Yen, Jyun-Yi Yang, Shih-Wen Weng
Abstract
Open AccessThe flood of protein structural Big Data is coming. With the belief that biotech researchers deserve powerful analysis engines to overcome the challenge of rapidly increasing computational demands, we are devoted to developing efficient protein structural alignment search algorithms to assist researchers as they push the frontiers of biological sciences and technology. Here, we present SARST2, an algorithm that integrates primary, secondary, and tertiary structural features with evolutionary statistics to perform accurate and rapid alignments. In large-scale benchmarks, SARST2 outperforms state-of-the-art methods in accuracy, while completing AlphaFold Database searches significantly faster and with substantially less memory than BLAST and Foldseek. It employs a filter-and-refine strategy enhanced by machine learning, a diagonal shortcut for word-matching, a weighted contact number-based scoring scheme, and a variable gap penalty based on substitution entropy. SARST2, implemented in Golang as standalone programs available at https://10lab.ceb.nycu.edu.tw/sarst2 and https://github.com/NYCU-10lab/sarst , enables massive database searches using even ordinary personal computers.