Can mutating the code under test make a stable test go flaky? Yes, and it happens more than you think!

post

research paper

flaky tests

Introducing test flimsiness — a new concept in flaky test research!

Author

Gregory M. Kapfhammer

Published

2026

Introduction

What if changing the code under test — not the test itself and not the execution environment — could turn a perfectly stable test case into a flaky one? That is exactly what my colleagues and I discovered in a new phenomenon that we call “test flimsiness”. The name is an acronym since FLIMsiness stands for FLakiness Induced by Mutations to the code under test. In (Parry et al. 2026) , published at the International Conference on Software Engineering (ICSE), we present the first large-scale characterization of this overlooked phenomenon. After analyzing over 30 million mutant-test pairs across 28 Python projects, we found flimsiness lurking in 54% of them. Keep reading to learn why this matters for anyone who uses mutation testing or relies on deterministic test outcomes!

Key Contributions

Defining and Characterizing Test Flimsiness: We introduce and formally characterize the concept of test flimsiness, where standard mutation operators applied to the code under test cause previously stable test cases to start failing intermittently. This is the first study to systematically investigate this phenomenon.
Large-Scale Empirical Study: We conducted an extensive empirical evaluation across 28 real-world Python projects, spanning over half a million test suite executions. Using rigorous statistical analysis with false discovery rate controls, we reliably identified transitions from stability to flakiness triggered by mutations.
A New Lens for Flaky Test Detection: Prior work perturbed the execution environment or injected flakiness into test code. We advance these efforts by perturbing a third major source of flakiness: the code under test. Our mutation-based rerunning strategy detects far more flaky tests than standard rerunning alone.
Comprehensive Public Dataset: We release a rich dataset comprising over half a million test suite runs and 30 million mutant-test pairs, annotated with test outcome reports and coverage data, in the paper’s replication package.

Empirical Results

Our study revealed several noteworthy findings about test flimsiness:

Flimsiness is prevalent: It occurs in 15 out of 28 projects (54%), with 0.7% of all mutants inducing flakiness in at least one test case. While the percentage may seem small, we argue that the implications for mutant-driven techniques are significant.
Mutation operators differ in their tendency to induce flakiness: For example, mutants produced by the ReplaceUnaryOperator have over 3 times higher odds of inducing flakiness compared to other operators. Interestingly, operators that are better at producing killable mutants also tend to be better at inducing flakiness.
Mutation-based rerunning detects far more flaky tests: The mutation-based strategy detected a median of 740 flaky tests per project compared to just 163 for standard rerunning. The flaky tests found by the mutation-based strategy are especially unlikely to be detected by the standard strategy, reinforcing flimsiness as a distinct form of flakiness worthy of further study in future research.
Unstable coverage does not explain flimsiness: Whether a test case non-deterministically covered the mutated line prior to mutation is a poor predictor of whether the mutant will induce flakiness. This suggests that the causes of flimsiness are more nuanced than simply manifesting unstable coverage.

Future Work

The paper opens several avenues for future investigation. My co-authors and I plan to study how flimsiness impacts specific mutant-driven techniques such as fault localization and regression testing. We also intend to survey developers to understand how they perceive flimsiness and whether they encounter it during routine use of mutation testing tools. Finally, we aim to investigate the root causes of flimsiness more systematically, potentially involving project developers in the analysis.

Further Details

Test flimsiness represents an exciting new direction in flaky test research, sitting at the intersection of mutation testing and test reliability. If you are interested in learning more, I encourage you to read (Parry et al. 2026) . If you have questions or experiences related to mutation-induced flakiness, please contact me. To stay updated on the latest developments in flaky test research, you can subscribe to my mailing list.

Return to Blog Post Listing

References

Parry, Owain, Gregory M. Kapfhammer, Michael Hilton, and Phil McMinn. 2026. “Test Flimsiness: Characterizing Flakiness Induced by Mutation to the Code Under Test.” In Proceedings of the 48th International Conference on Software Engineering.