Empirically comparing hazard-guided LLM mutation techniques with existing LLM- and rule-based approaches
empirical study
mutation testing
software testing
Proceedings of the 30th International Conference on Evaluation and Assessment in Software Engineering
Abstract
Mutation testing tools normally rely on rule-based operators that mechanically swap and change source code tokens without un- derstanding the code’s purpose. Recently, Large Language Models (LLMs) have enabled mutant generators to consider more of the code’s context and history. However, current LLM-based mutation testing methods have limited prompts that prevent them from unleashing the full power and creativity of the LLM to produce a diverse set of aggressive, yet realistic and productive, mutants. This paper’s novel approach uses an LLM to interpret a method and re-implement it entirely with mutations guided by hazard analysis, analogous to a programmer misunderstanding a project’s requirements or an aspect of an algorithm’s implementation. To enable the empirical comparison of the new hazard-guided techniques with both prior LLM-based and traditional mutation testing tools, this paper also presents and applies a framework that integrates representative LLM-based methods. Using this framework and 279 bugs in 15 projects from the Defects4J dataset, the results show that the hazard-guided techniques can harness both local and cloud-based LLMs to generate compilable, diverse, and powerful mutants that help test cases to detect unique defects not found by other methods.Details
Reference
@inproceedings{Maton2026,
author = {Megan Maton and Gregory M. Kapfhammer and Phil McMinn},
booktitle = {Proceedings of the 30th International Conference on Evaluation and Assessment in Software Engineering},
title = {Empirically comparing hazard-guided LLM mutation techniques with existing LLM- and rule-based approaches},
year = {2026}
}