Exploring non-target screening variability in unsupervised multivariate time trend analysis of LC-HRMS data.
Reyhaneh Armin, Maryam Vosough, Torsten C Schmidt
Abstract
Open AccessNon-target screening (NTS) using liquid chromatography-high-resolution mass spectrometry has become essential for uncovering unknown contaminants in complex matrices such as industrial wastewater. A key goal in these applications is detecting and interpreting time trends, especially spill-related events. The clarity of multivariate models depends on the quality of feature lists from various software tools. In this study, we evaluated five peak picking tools (MarkerView, MZmine3, XCMS, OpenMS, and SIRIUS) for unsupervised time trend exploration using sparse principal component analysis (SPCA). SPCA selects the most informative features per component, improving interpretability and reducing confounding variables. Two datasets were used: a controlled validation set of pooled wastewater samples with spiked target compounds exhibiting known profiles and a real-world dataset comprising 52 consecutive daily industrial wastewater samples. The first dataset facilitated analysis of tuning parameters with SPCA distinguishing spiking patterns associated with components, highlighting differences in feature/artifact prioritization across tools. Tools XCMS, MZmine3, and OpenMS showed higher consistency and were selected for next analysis. In the second phase, SPCA was combined with stratified bootstrapping (SBS-SPCA) to assess the reliability of trend detection, exemplified by specific targets. Five out of nine markers, showing more temporal persistency, were robustly detected across tools (selection frequency > 70%) under optimized tuning conditions. These findings indicate that interpretable, sparse models enhance marker detection in unsupervised settings and shed light on how software-driven feature structures impact multivariate outcomes in time-series NTS data. Such insights are especially pertinent for future high-throughput applications involving temporally dynamic exposure scenarios.