Does automated test rerunning find the flaky tests that developers fix?

post

flaky tests

research paper

Let’s leverage developer-repaired flaky tests!

Author

Gregory M. Kapfhammer

Published

2022

Introduction

As a software developer, you’ve likely encountered flaky tests that seem to pass and fail at random, disrupting the continuous integration process and wasting your valuable development time. While automated tools promise to detect these problematic tests, it is important to study how well they actually find the flaky tests that developers care about fixing. My colleagues and I recently published (Parry et al. 2022) that investigates this critical question by examining real developer behavior. Keep reading to learn about our surprising findings!

Limitations of Current Methods

Most research on flaky test detection suffers from a fundamental evaluation problem. Researchers typically assess their techniques against baselines created by automated rerunning tools. However, these approaches don’t tell us whether the detected flaky tests are actually the ones that developers would spend time fixing in practice. This paper’s study addresses this gap by using a baseline derived from actual developer behavior: 75 commits where human software developers repaired flaky tests in 31 real-world Python projects. This developer-based methodology may provide a more realistic assessment of the usefulness of flaky test detection.

Key Research Findings

Automated Rerunning Has Low Recall: When we applied automated test rerunning to the flaky tests that developers actually fixed, it only detected them in 40% of the studied commits. This surprisingly low recall suggests that automated rerunning often misses the flaky tests that matter most to the developers of open-source Python projects.
Noise Improves Detection: Rerunning with environmental noise (e.g., randomizing thread priorities, test execution order, network speeds, and Python versions) performed significantly better than standard automated test rerunning, improving recall from 21% to 40%. This result supports the importance of running tests in diverse environments.
Common Causes and Repairs: Through manual analysis of the code changes in the 75 commits, we found that randomness was the most frequent cause of flaky tests, followed by asynchronous waiting issues and timing problems. The most common developer repair was widening test assertion ranges to accept broader value ranges.
Evaluation Methodology Matters: Our results demonstrate that using automated rerunning as a baseline for evaluating other detection techniques may not accurately reflect their usefulness for developers, since rerunning itself may not find developer-relevant flaky tests.

Important Implications

For developers considering automated flaky test detection tools, this paper’s findings suggest that current rerunning-based approaches have limited practical value. If you do use automated rerunning, make sure to include environmental noise to improve detection rates. More importantly, developers should not rely solely on automated tools that rerun tests since manual inspection and developer intuition remain crucial for identifying problematic flaky tests.

For researchers developing new detection techniques, this paper’s study highlights the importance of evaluation methodology. Supplementing traditional rerun-based evaluations with developer-based baselines provides a more realistic assessment of practical usefulness.

Conclusion

This research reveals a significant gap between what automated tools detect and what developers actually need to fix. The 40% recall rate of automated rerunning against developer-repaired flaky tests indicates that current detection approaches may miss many practically important cases. Our developer-based evaluation methodology offers a more realistic way to assess flaky test detection techniques, focusing on tests that developers have demonstrated are worth their time to repair. Going forward, it is critical to understand, leverage, and enhance this developer perspective as we create effective solutions to the problem of flaky tests.

Further Details

As my colleagues and I continue exploring the complexities of flaky test detection and developer workflows, your insights and experiences are invaluable! If you have thoughts about flaky test detection tools or experiences with developer-relevant testing challenges, please contact me. Furthermore, if you wish to stay informed about new developments and blog posts related to flaky tests and software testing research, consider subscribing to my mailing list. The intersection of developer behavior and automated testing tools remains a rich area for future investigation!

Return to Blog Post Listing

References

Parry, Owain, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2022. “What Do Developer-Repaired Flaky Tests Tell Us about the Effectiveness of Automated Flaky Test Detection?” In Proceedings of the 3rd International Conference on Automation of Software Test.