Test failure analysis with LLM in CI pipeline

On my current project, I'm the sole test engineer in a team with several developers. We merge pull requests only when regression tests are implemented, so I sometimes find myself under pressure to handle multiple tasks at once - especially when there are failed tests. Without my input, it can be difficult to understand what a test is doing and what exactly is causing a failure: is it a bug or does a test need to be updated?

After reading many posts about AI in testing on softwaretestingweekly.com, I decided to give it a try and integrate Claude into our pipelines. The idea was to feed it test reports, grant it read-only access to the repository and the changes in a PR, and ask it to analyze everything and leave a comment with its findings.

Photo by cottonbro studio

One month later, here are my thoughts:

Claude is quite good at understanding and explaining test scenarios; I honestly can't recall a time when it was wrong
It's also good at understanding changes introduced in a PR
Claude can match the first two points and understand how modifications in the code lead to test failures
However, it has very limited context: it doesn't attend your meetings, it doesn't have access to requirements, and it doesn't know your product as a whole - so Claude's assumptions can be incorrect; it mostly treats code changes as the source of truth
Nevertheless, developers are happy because it's much easier and faster to read its findings than to analyze sometimes quite complex tests written in a language they've never worked with
This also means that my input isn't always required, so I have less context switching
I even made small test updates directly in the GitHub UI a few times thanks to this analysis
When there are many failed tests (>= 10), Claude seems to struggle to keep context and can forget to follow the post formatting rules defined in the prompt
Currently, each analysis is posted as a separate comment instead of rewriting an existing reply, which can quickly flood your PR

All in all, test failure analysis with an LLM has had a rather positive impact. It didn't revolutionize our work or anything like that, but it made things a bit easier. Even when Claude's assumptions are wrong, the overall analysis provides a good starting point for us humans.

I can't go into implementation details, so here are just a few technical notes:

We're using anthropics/claude-code-action
To avoid overloading its context, we feed json files of failed tests produced by allure; giving it the entire job log would be too much - these can reach megabytes of text
GitHub supports collapsible blocks, use them to save space and improve readability; however, as I mentioned above, Claude tends to forget about this

The examples in Anthropic's repository are very good. Just write your prompt and pass the information about failed tests to Claude.

19.01.2026