CANNIER gives software developers a best-of-both-worlds approach to flaky test detection
Introduction
Have you ever been stuck between the proverbial “rock and a hard place” when it comes to flaky test detection? On the one hand, you can repeatedly rerun your tests, which is accurate but can take an enormous amount of time and computational resources. On the other hand, you can use a machine learning model to predict flaky tests, which is fast but often not accurate enough to be reliable. This trade-off between prediction accuracy and speed has been a long-standing challenge for software developers who tackle flaky tests.
In a recent paper, my co-authors and I introduced CANNIER, an approach that offers a “best-of-both-worlds” solution to the problem of flaky test detection. Our paper, “Empirically Evaluating Flaky Test Detection Techniques Combining Test Case Rerunning and Machine Learning Models” (Parry et al. 2023)
Key Contributions
Our paper makes the following key contributions to the field of flaky test detection:
CANNIER Approach: A novel technique that uses machine learning models as a heuristic to significantly reduce the problem space for rerunning-based flaky test detection. For many projects, CANNIER leads to a dramatic reduction in the time cost of detection with only a minimal decrease in performance.
Extensive Tooling: The lead author developed and released a comprehensive framework of automated tools to facilitate the replication of our results and to empower other researchers to build upon our work in hybrid flaky test detection.
Comprehensive Empirical Evaluation: A study involving nearly 90,000 test cases from 30 Python projects that not only demonstrates the effectiveness of CANNIER but also reveals new insights into machine learning-based flaky test detection.
Public Dataset: We have made our entire dataset publicly available to foster further research and innovation in the flaky test detection community.
Empirical Results
Our empirical evaluation shows that, while only slightly dropping accuracy, CANNIER can decrease the time cost of rerunning-based flaky test detection techniques by an average of 88% across three different techniques. For instance, when applying CANNIER to the rerunning technique, we were able to reduce the time cost by 89% while maintaining a high Matthews correlation coefficient (MCC) of 0.92 for the correlation between the technique’s flakiness predictions and the actual flakiness labels. This result demonstrates that CANNIER is a practical and effective solution for developers struggling with the high cost of flaky test detection. Please read the paper for its many other empirical findings!
Future Work
The findings in this paper open up several avenues for future research. For instance, we plan to further investigate the features associated with test flakiness, potentially using causal inference techniques to gain a deeper understanding of the root causes of flakiness. We also intend to evaluate the performance of CANNIER on more specific categories of flaky tests. Since this paper’s experiments were done on open-source Python projects, we are also interested in applying CANNIER to both commercial software projects and programs and test suites written in other programming languages like Java.
If you are interested in learning more about CANNIER and our research on flaky test detection, I have encourage you to read the full paper (Parry et al. 2023)