Where tests fall short: Empirically analyzing oracle gaps in covered code

empirical study

mutation testing

software testing

Proceedings of the 19th International Symposium on Empirical Software Engineering and Measurement

Authors

Megan Maton

Gregory M. Kapfhammer

Phil McMinn

Published

2025

Abstract

Background: Developers often rely on statement coverage to assess test suite quality. However, statement coverage alone may only lead to 10% fault detection, necessitating more rigorous approaches. While mutation testing is effective, its execution and human analysis costs remain high. Identifying covered statements that are not checked by oracles (e.g., assertions) offers a cost-effective alternative; however, the lack of empirical evidence for selecting the appropriate Oracle Gap Calculation Approach (OGCA) prevents developers from making informed choices. Aims: This knowledge-seeking study compares oracle gap characteristics determined by different OGCAs to assist developers in choosing the most valuable approach for their use cases. Method: Using mixed-method empirical analysis, we conduct an in-depth evaluation of the oracle gaps produced using three OGCAs: Checked Coverage using a Dynamic Slicer (CCDS), Checked Coverage using an Observational Slicer (CCOS), and Pseudo-Tested Statement Identification (PTSI). Across 30 Java classes from six open-source projects, we report on a quantitative evaluation of gap prominence, distribution, fault detection correlation and execution times, as well as results from a qualitative manual inspection of the statement types found in the oracle gaps. Results: The qualitative analysis showed data-loading statements, iteration statements and output updates to be most prominent in the oracle gaps. PTSI identified the oracle gaps with the lowest median mutation score (0.32), highlighting areas requiring more fault detection improvement compared to CCDS (0.76) and CCOS (0.50). PTSI also had the shortest median execution time (19.9 seconds), far quicker than both CCDS (273.2 seconds) and CCOS (5957.1 seconds). Conclusions: PTSI quickly reveals the priority testing areas for improved fault detection, making it an effective OGCA for developers to identify where tests fall short

Details

Paper
PseudoTested/esem-2025-replication-package
PseudoTested/PseudoSweep

Reference

@inproceedings{Maton2025,
 author = {Megan Maton and Gregory M. Kapfhammer and Phil McMinn},
 booktitle = {Proceedings of the 19th International Symposium on Empirical Software Engineering and Measurement},
 title = {Where tests fall short: Empirically analyzing oracle gaps in covered code},
 year = {2025}
}

Return to Paper Listing