How do you test AI agents in production? The unpredictability is overwhelming.[D]
I’ve been in QA for almost a decade. My mental model for quality was always: given input X, assert output Y. Now I’m on a team that’s shipping an LLM-based agent that handles multi-step tasks. I genuinely do not know how to test this in a way that feels rigorous.
The thing works. But the output isn’t deterministic. The same input can produce different reasoning chains across runs. Hell even with temp=0 I see variation in tool selection and intermediate steps. My normal instincts don’t map here. I can’t write an assertion and run it a thousand times to track flakiness. I’m at a loss for what to do.
Snapshot testing on final outputs is too brittle. If there’s a correct response that’s worded differently it breaks the test. Regex/keyword matching on outputs misses reasoning errors that accidentally land on the correct answer. Human eval isn’t automatable and doesn’t scale. Evals with a scoring rubric almost works but I don’t have a way to set pass/fail thresholds.
I want something conceptually equivalent to integration tests for reasoning steps. Like, given this tool result does the next step correctly incorporate it? I don’t know how to make that assertion without either hardcoding expected outputs or using another LLM as a judge, which would introduce a new failure mode into my test suite.
The agent runs inside our product. There are real uses and actual consequences when it makes a bad call.
Is there a framework that allows for verifying of agentic reasoning?
[link] [comments]
Want to read more?
Check out the full article on the original site