Flaky tests are just one symptom — your test suite needs a health check!
Introduction
When software developers talk about problematic test suites, the conversation often begins and ends with flaky tests. And while flakiness is certainly a serious concern, is it really the only symptom of an unhealthy test suite? My colleagues and I argue that it is not. In (McMinn, Roslan, and Kapfhammer 2025)
Key Contributions
Nine Health Indicators: We identify a checklist of indicators that signal an unhealthy test suite, starting with flakiness but extending to low code coverage, pseudo-testedness, low mutation scores, long-running test suites, low test diversity, high brittleness, low realism, and high variability of indicator metrics.
Trade-offs Between Indicators: We argue that some indicators are complementary while others are in tension. For instance, a test suite with fewer assertions may be less flaky but also more pseudo-tested, meaning it executes code without actually checking it. Pursuing a high mutation score might increase brittleness if tests become tightly coupled to implementation details rather than intended behavior.
A Research Agenda: We outline seven challenges for the research community, ranging from identifying further indicators and quantifying trade-offs to building tooling that can give developers actionable recommendations for improving the overall health of their test suites.
Key Insights
Since this is a short position paper and manifesto, the contribution is conceptual rather than experimental. The paper synthesizes insights from across the software testing literature to make the case that indicators like pseudo-testedness, brittleness, and realism deserve the same level of research attention that flakiness has received. The accompanying resource page catalogs existing detection and improvement tools for each indicator, showing where tooling already exists and where gaps remain.
Future Work
The paper outlines several open challenges, including how to measure less well-understood indicators like test realism and brittleness, how to combine multiple indicators into a composite picture of test suite health, and how to study the evolution of test suite health over the lifecycle of a project. Ultimately, the goal is to move toward automated tools that not only diagnose problems but also recommend concrete actions for developers to improve their test suites.
If you are interested in thinking about test quality beyond flakiness, I encourage you to read (McMinn, Roslan, and Kapfhammer 2025)