Does automated test rerunning find the flaky tests that developers fix?
Introduction
As a software developer, you’ve likely encountered flaky tests that seem to pass and fail at random, disrupting the continuous integration process and wasting your valuable development time. While automated tools promise to detect these problematic tests, it is important to study how well they actually find the flaky tests that developers care about fixing. My colleagues and I recently published (Parry et al. 2022)
Limitations of Current Methods
Most research on flaky test detection suffers from a fundamental evaluation problem. Researchers typically assess their techniques against baselines created by automated rerunning tools. However, these approaches don’t tell us whether the detected flaky tests are actually the ones that developers would spend time fixing in practice. This paper’s study addresses this gap by using a baseline derived from actual developer behavior: 75 commits where human software developers repaired flaky tests in 31 real-world Python projects. This developer-based methodology may provide a more realistic assessment of the usefulness of flaky test detection.
Key Research Findings
Automated Rerunning Has Low Recall: When we applied automated test rerunning to the flaky tests that developers actually fixed, it only detected them in 40% of the studied commits. This surprisingly low recall suggests that automated rerunning often misses the flaky tests that matter most to the developers of open-source Python projects.
Noise Improves Detection: Rerunning with environmental noise (e.g., randomizing thread priorities, test execution order, network speeds, and Python versions) performed significantly better than standard automated test rerunning, improving recall from 21% to 40%. This result supports the importance of running tests in diverse environments.
Common Causes and Repairs: Through manual analysis of the code changes in the 75 commits, we found that randomness was the most frequent cause of flaky tests, followed by asynchronous waiting issues and timing problems. The most common developer repair was widening test assertion ranges to accept broader value ranges.
Evaluation Methodology Matters: Our results demonstrate that using automated rerunning as a baseline for evaluating other detection techniques may not accurately reflect their usefulness for developers, since rerunning itself may not find developer-relevant flaky tests.
Important Implications
For developers considering automated flaky test detection tools, this paper’s findings suggest that current rerunning-based approaches have limited practical value. If you do use automated rerunning, make sure to include environmental noise to improve detection rates. More importantly, developers should not rely solely on automated tools that rerun tests since manual inspection and developer intuition remain crucial for identifying problematic flaky tests.
For researchers developing new detection techniques, this paper’s study highlights the importance of evaluation methodology. Supplementing traditional rerun-based evaluations with developer-based baselines provides a more realistic assessment of practical usefulness.
Conclusion
This research reveals a significant gap between what automated tools detect and what developers actually need to fix. The 40% recall rate of automated rerunning against developer-repaired flaky tests indicates that current detection approaches may miss many practically important cases. Our developer-based evaluation methodology offers a more realistic way to assess flaky test detection techniques, focusing on tests that developers have demonstrated are worth their time to repair. Going forward, it is critical to understand, leverage, and enhance this developer perspective as we create effective solutions to the problem of flaky tests.
As my colleagues and I continue exploring the complexities of flaky test detection and developer workflows, your insights and experiences are invaluable! If you have thoughts about flaky test detection tools or experiences with developer-relevant testing challenges, please contact me. Furthermore, if you wish to stay informed about new developments and blog posts related to flaky tests and software testing research, consider subscribing to my mailing list. The intersection of developer behavior and automated testing tools remains a rich area for future investigation!