Significance, Fragility, and Robustness in Clinical Trials: Stratifying Statistical Evidence.
Thomas F Heston
Abstract
Open AccessBackground Current reporting standards treat p-values, effect sizes, and confidence intervals as complete evidence, but this is only partial: it quantifies significance and magnitude, not classification stability (fragility) or distance from therapeutic neutrality (robustness). This study validates the p-fr-nb framework in two-arm, binary-outcome clinical trials. Framework extensions to continuous, ordinal, survival, and correlation analyses exist but are not empirically validated here. For this study, the p-fr-nb triplet is defined as providing the p-value (significance), fragility (classification stability), and robustness (distance from neutrality) in trial results. This triplet assesses completeness across three statistical inferential dimensions; it stratifies evidence quality but does not prove truth, causality, or replication. Methodology A pragmatic observational validation study of two-arm, binary-outcome clinical trials identified in PubMed (n = 129 across 15 specialties) was conducted. Null expectations were generated with a Monte Carlo simulation of 720,000 trials across 360 design scenarios, including 120,000 null trials (true relative risk (RR) = 1.0). Simulations represent unfiltered random trial generation and do not model publication bias or selective reporting. Fragility was measured by the modified-arm fragility quotient (MFQ; fragility index divided by the size of the modified arm). Robustness was measured by the risk quotient (RQ), defined for 2×2 tables as RQ = |ad - bc| / (N²/4). Concordant-positive (CP) evidence was defined as p ≤ 0.05, MFQ > 0.10, and RQ ≥ 0.227, with the RQ cutoffs based on large-scale simulation. The significant-fragile-weak (SFW) pattern was defined as p ≤ 0.05, MFQ ≤ 0.10, and RQ < 0.075. The main outcomes were the rates of CP and SFW among statistically significant empirical trials, compared with null-simulation expectations. Results In null simulations (RR = 1.0), the CP triplet occurred in 1.4% of significant trials; even with strong effects (RR = 0.60), it appeared in only 4.7%. Of the 129 trials analyzed, 77 (59.7%) were statistically significant. Among these 77 trials, 30 (39.0%; 95% confidence interval = 28.8-50.1%) met the CP criteria, a 27.9-fold higher compared with null expectations (p < 0.0001). Overall, 61.0% of significant trials were fragile (MFQ ≤ 0.10, 47/77), 31.2% were weakly robust (RQ < 0.075, 24/77), and 31.2% showed the SFW pattern (24/77). Conclusions In this heterogeneous sample, the p-fr-nb framework stratified positive findings beyond what p-values and confidence intervals reveal. Among 77 significant trials, 39.0% met stringent criteria for stability and strong robustness, a distinction not visible from p-values alone. Conversely, 31.2% showed the SFW pattern, where significance was fragile, and separation from no effect was minimal. Fragility and robustness metrics provide interpretive dimensions not captured by p-values alone, enhancing assessment of evidence heterogeneity relevant to reproducibility and clinical interpretation. These data support further evaluation of incorporating fragility and robustness metrics into the reporting of clinical trial results.