Mutation testing

Quality assurance consists of many activities, but often our primary goal is to verify and validate software our team is working on. Through various means we try to establish whether it fulfills customer requirements and other criteria like code quality, documentation, security, and so on. But who ensures that our job was done correctly? Even when we have a colleague who can review our test artifacts, they probably don't have enough time or context to do this thoroughly due to their own commitments.

Collection of various comic books. An X-Men comic book - a series about mutants - is in the focus.

This is why I always liked mutation testing as it aims to evaluate the quality of tests themselves. The idea is that we introduce small changes in the software under test and see if our tests managed to catch unexpected behavior by failing. We can, for instance, modify an SQL query from this:

select * from shopping_carts where user_id = ?

To this:

select * from shopping_carts where user_id != ?

So even though an insert method somewhere else would correctly add items to a shopping cart, it'll appear to be empty to the user due to the inverted where clause. If we don't have a test to validate the shopping cart after adding some items, all our tests will pass, revealing a gap in coverage. Like any other form of testing, mutation testing isn't exhaustive and it cannot prove the absence of bugs. Still, it can be a useful tool.

There are numerous libraries that automate the process of creating modified software. The issue is that they generally work with unit or otherwise tightly coupled tests. Therefore when you need to build a Go project, launch the three separate processes it requires to function, then run tests written in TypeScript, and evaluate the results - this quickly goes beyond the scope of what these libraries can do. Which is why QA engineers working with higher level end-to-end tests rarely actually perform mutation testing. Well, that and time constraints 😀

Enter the AI agents. As they aren't deterministic systems designed to perform one task in a certain way, they can easily close the gap between separate systems:

  • Read the code that needs to be tested
  • Introduce mutations
  • Rebuild the service
  • Run tests
  • Analyze the results
  • Repeat the process
  • Produce a report

Regardless of what these systems are or how they differ across the zoo of projects and technologies a large company might have.

My recent experiments with this have shown a lot of promise. Even though the lack of determinism means that LLMs can and will produce misleading outputs from time to time, more often than not they provide valuable insights. For instance, the agent can look at a failed test and say: "Hey, this test failed after I broke the code, but this is because the status code was 500 instead of 204. If the status code was valid, this test would've passed because it missed [details] check." Not bad for an additional quality gate that wouldn't exist otherwise, eh?

In any case, whether you're working with unit or heavy end-to-end tests, I suggest giving mutation testing a try.