Because they pass or fail without code changes, flaky tests cause serious problems such as spuriously failing builds and the eroding of developers’ trust in tests. Many previous evaluations of automated flaky test detection techniques do not accurately assess their usefulness for the developers who identify the flaky tests to repair. This is because researchers evaluate detection techniques against baselines that are not derived from past developer behavior or against no baselines at all. To study the effectiveness of an automated test rerunning technique, a common baseline for other approaches to detection, this paper uses 75 commits — authored by human software developers — that repair test flakiness in 31 real-world Python projects. Surprisingly, automated rerunning detects the developer-repaired flaky tests in only 40% of the studied commits. This result suggests that automated rerunning does not often find those flaky tests that developers fix, implying that it makes an unsuitable baseline for assessing a detection technique’s usefulness for developers.
Parry, O., Kapfhammer, G. M., Hilton, M., & McMinn, P. (2022). What do developer-repaired flaky tests tell us about the effectiveness of automated flaky test detection? Proceedings of the 3rd International Conference on Automation of Software Test.
Want to cite this paper? Look in the BiBTeX file of gkapfham/research-bibliography for the key "Parry2022c".