Enhancing molecular property prediction through data integration and consistency assessment.
Raquel Parrondo-Pizarro, Luca Menestrina, Ricard Garcia-Serna, Adrià Fernández-Torras, Jordi Mestres
Abstract
Open AccessData heterogeneity and distributional misalignments pose critical challenges for machine learning models, often compromising predictive accuracy. These challenges are exemplified in preclinical safety modeling, a crucial step in early-stage drug discovery where limited data and experimental constraints exacerbate integration issues. Analyzing public ADME datasets, we uncovered significant misalignments as well as inconsistent property annotations between gold-standard and popular benchmark sources, such as Therapeutic Data Commons. These dataset discrepancies, which can arise from differences in various factors, including experimental conditions in data collection as well as chemical space coverage, can introduce noise and ultimately degrade model performance. Data standardization, despite harmonizing discrepancies and increasing the training set size, may not always lead to an improvement in predictive performance. This highlights the importance of rigorous data consistency assessment (DCA) prior to modeling. To facilitate a systematic DCA across diverse datasets, we developed AssayInspector, a model-agnostic package that leverages statistics, visualizations, and diagnostic summaries to identify outliers, batch effects, and discrepancies. Beyond preclinical safety, DCA can play a crucial role in federated learning scenarios, enabling effective transfer learning across heterogeneous data sources and supporting reliable integration across diverse scientific domains.