Don't Stop Me Now, `Cause I'm Having a Good Time Screening: Evaluation of Stopping Methods for Safe Use of Priority Screening in Systematic Reviews.
Tim Repke, Francesca Tinsdeall, Diana Danilenko, Sergio Graziosi, Finn Müller-Hansen, Lena Schmidt, James Thomas, Gert van Valkenhoef
Abstract
Open AccessIntroduction: Priority screening has the potential to reduce the number of records that need to be annotated in systematic literature reviews. So-called technology-assisted reviews (TAR) use machine-learning with prior include/exclude annotations to continuously rank unseen records by their predicted relevance to find relevant records earlier. In this article, we present a systematic evaluation of methods to determine when it is safe to stop screening when using prioritization. Methods: We implement an open-source evaluation framework that features a novel method to generate rankings and simulate priority screening processes for 81 real-world data sets. We use these simulations to evaluate 15 statistical or rule-based (heuristic) stopping methods, testing a range of hyperparameters for each. Results: The work-saving potential and performance of stopping criteria heavily rely on "good" rankings, which are typically not achieved by a single ranking algorithm across the entire screening process. Our evaluation shows that almost all existing stopping methods either fail to reliably stop without missing relevant records or fail to utilize the full potential work-savings. Only one method reliably meets the set recall target, but stops conservatively. Conclusions: Many digital evidence synthesis tools provide priority screening features that are already used in many research projects. However, the theoretical work-savings demonstrated in retrospective simulations of prioritization can only be unlocked with safe and reproducible stopping criteria. Our results highlight the need for improved stopping methods and guidelines on how to responsibly use priority screening. We also urge screening platforms to provide indicators and authors to transparently report metrics when automating (parts of) their synthesis.