Test failure analysis with LLM in CI pipeline
On my current project, I'm the sole test engineer in a team with several developers. We merge pull requests only when regression tests are implemented, so I sometimes find myself under pressure to handle multiple tasks at once - especially when there are failed tests. Without my input, it can be difficult to understand what a test is doing and what exactly is causing a failure: is it a bug or does a test need to be updated?
After reading many posts about AI in testing on softwaretestingweekly.com, I decided to give it a try and integrate Claude into our pipelines. The idea was to feed it test reports, grant it read-only access to the repository and the changes in a PR, and ask it to analyze everything and leave a comment with its findings.
One month later, here are my thoughts:
- Claude is quite good at understanding and explaining test scenarios; I honestly can't recall a time when it was wrong
- It's also good at understanding changes introduced in a PR
- Claude can match the first two points and understand how modifications in the code lead to test failures
- However, it has very limited context: it doesn't attend your meetings, it doesn't have access to requirements, and it doesn't know your product as a whole - so Claude's assumptions can be incorrect; it mostly treats code changes as the source of truth
- Nevertheless, developers are happy because it's much easier and faster to read its findings than to analyze sometimes quite complex tests written in a language they've never worked with
- This also means that my input isn't always required, so I have less context switching
- I even made small test updates directly in the GitHub UI a few times thanks to this analysis
- When there are many failed tests (>= 10), Claude seems to struggle to keep context and can forget to follow the post formatting rules defined in the prompt
- Currently, each analysis is posted as a separate comment instead of rewriting an existing reply, which can quickly flood your PR
All in all, test failure analysis with an LLM has had a rather positive impact. It didn't revolutionize our work or anything like that, but it made things a bit easier. Even when Claude's assumptions are wrong, the overall analysis provides a good starting point for us humans.
I can't go into implementation details, so here are just a few technical notes:
- We're using anthropics/claude-code-action
- To avoid overloading its context, we feed json files of failed tests produced by allure; giving it the entire job log would be too much - these can reach megabytes of text
- GitHub supports collapsible blocks, use them to save space and improve readability; however, as I mentioned above, Claude tends to forget about this
The examples in Anthropic's repository are very good. Just write your prompt and pass the information about failed tests to Claude.