EpiCurveBench: Evaluating epidemic curve digitization.
Thomas Berkane, Maimuna S Majumder
Abstract
Open AccessAccurate data on disease case counts over time is essential for training reliable disease forecasting models. However, such data is often locked in non-machine-readable formats, most commonly as epidemic curve (epicurve) images-charts that depict case counts of a given disease over time, for a given location. Digitizing these charts would greatly expand the data available for forecasting models, improving their accuracy. Manual digitization, though, is very time-consuming, and existing automated methods struggle with real-world epicurves due to dense datapoints, overlapping series, and varied visual styles. To address this, we present EpiCurveBench, a benchmark of 100 manually curated and annotated epicurve images collected from diverse sources. The dataset spans a wide range of chart styles, from simple to highly complex. We also introduce EpiCurve Similarity (ECS), a new evaluation metric that captures the temporal structure of epicurves, handles series of varying lengths, and remains stable in the presence of incomplete data. Using this metric, we evaluate state-of-the-art chart data extraction methods on EpiCurveBench and find substantial room for improvement, with the best model achieving an ECS of only 42.9%. We release the dataset and evaluation pipeline to accelerate progress in epicurve extraction. More broadly, the difficulty of EpiCurveBench compared to existing chart extraction benchmarks provides a rigorous testbed for advancing chart data extraction methods beyond disease forecasting.