Testing AI Agents Like You Mean It — Alessandro Usseglio Viretta

Testing AI agents is harder than it looks, and the methods you'd reach for from traditional software don't transfer cleanly. Teams often end up with thin test suites not out of laziness but because the obvious tools — unit tests, fixtures, exact-match assertions — don't fit the shape of the problem.

Traditional software has a finite input space. You can enumerate the cases that matter, write tests for them, and be done. An AI agent takes natural language input from real humans, which means the input space is effectively infinite. Worse, failures are often emergent: the agent handles question A fine, handles question B fine, and then produces something wrong when A and B appear together in the same conversation. You can't discover that by testing A and B separately.

There's also the subtlety problem. When traditional software fails, it usually fails loudly: an exception, a crash, a wrong number. When an agent fails, it usually produces something that looks right. Plausible, confident, and wrong. Those are harder to catch.

In a preprint on meta-prompting for behavior specification, I describe how to document the knowledge accumulated during prompt refinement. Testing is the natural extension: once you have a formal specification of what your agent should do, you can verify whether it does. The eight methods below are how.

1. Persona-based testing is the most intuitive starting point. You construct artificial users with specific behavioral profiles and run conversations with them. The mistake most practitioners make is keeping personas vague: "a difficult user," "a confused user." What reveals failures is combining specific traits across independent axes. A user who is an expert in the domain, gives terse one-word answers, is impatient, and wants something slightly outside what the agent is designed for: that combination stresses the system in ways no single-axis persona would. Published research from NeurIPS 2025 confirms that persona diversity improves both the effectiveness and the coverage of adversarial testing [1].

2. Behavioral contract testing translates your specification into explicit assertions. Each assertion takes the form: given this conversation state, when the user says this, the agent must do X and must not do Y. If your agent is a slot-filling system that collects information from users, a contract test might verify that the agent never asks for a piece of information the user already provided three turns earlier. This is the analog of unit testing in software, except the units are behavioral properties rather than functions.

3. Mutation testing asks a different question: how good is your test suite? The method comes from software engineering. You take your behavior specification, introduce small deliberate degradations (remove a constraint, weaken a condition, invert a rule), and check whether your tests catch the change. If a degraded specification passes all your tests, your test suite has a gap. This is how you find out which parts of your specification are doing real work versus which parts are decorative.

4. Adversarial prompt generation goes beyond personas to systematically target specific failure classes: jailbreak attempts, inputs sitting at the edge of a behavioral constraint, irrelevant information injected mid-conversation to see whether it contaminates the agent's behavior. For agents that process email or other untrusted input, prompt injection deserves particular attention. The OWASP Top 10 for Agentic Applications identifies goal hijacking as the leading attack vector for agentic applications [2]. Tools like Garak (NVIDIA) and PyRIT (Microsoft) can generate adversarial inputs at scale [3, 4].

5. Transcript-based regression testing comes from a diagnostic habit I keep coming back to: when an agent misbehaves, reading the conversation transcript tells you far more than re-reading the specification does. The transcript shows you exactly which turn went wrong, which piece of information got skipped, which shortcut the model took. The specification rarely explains why a failure happened.

I discuss this pattern at length in the meta-prompting preprint, where it emerged across refinement cycles: behavioral failures became legible only when examined turn by turn in the actual conversation record, not by re-examining the prompt. That insight grounds a practical discipline. Every time a failure appears in production or in a testing session, save the full transcript, identify the exact turn where things went wrong, and encode that as a test case. The minimum input sequence that reproduces the failure becomes a permanent regression test. Run it after every specification or model change. Over time the regression suite becomes the most useful artifact in your testing infrastructure: a map of every failure mode you've encountered, grounded in real behavior.

6. Property-based testing shifts from specific cases to general rules. Rather than writing individual test cases, you define properties that should hold across all possible inputs, then use automated generation to find inputs that violate them. A slot-filling agent should never ask for information it already has (idempotency). Adding more context should never make the agent less capable (monotonicity). Information from one conversation session should not contaminate a subsequent one (non-interference). A single counterexample is a confirmed failure.

7. Multi-model disagreement testing runs the same inputs through multiple LLMs with the same specification and flags cases where their outputs diverge significantly. Disagreement is a signal: if two models behave differently given the same spec and the same input, the spec doesn't constrain behavior enough to produce consistent results. Regions of high disagreement are specification gaps worth addressing. This method also gives you advance warning about model migration risk: behavioral properties that are consistent across models are portable; properties that vary are model-dependent and will require specification work if you ever switch.

8. Human red-teaming is the last layer, and it's necessary for a reason that's easy to underestimate. AI personas miss failure modes that humans find, because humans make unexpected semantic leaps, have emotional reactions that change their behavior mid-conversation, and try combinations no automated system would think to generate. The key is making human sessions structured rather than open-ended: assign each tester a specific failure class to probe, require them to document the exact input sequence that caused a failure, and run them blind to the specification itself. Convert every confirmed failure into a regression test before the session ends.

These eight methods aren't independent. They form a stack, ordered by speed and cost. Contract tests run on every specification change; persona and adversarial tests on every deployment. Mutation and regression suites run weekly, or after model updates. Human sessions are reserved for major releases. At runtime, production monitoring extends the stack continuously, feeding new failures back into the regression suite.

For the attentive reader, there's a second way to cut the same stack — by what the test actually decides. Evaluating LLM output is hard because "correct" is rarely a single string, so it helps to separate the methods above into three tiers of evaluation difficulty. Tier A is deterministic: property tests and structural contract tests that check exact, mechanical conditions — did the agent ask for a field it already had, did the output validate against a schema, did a forbidden token appear. No LLM is needed to grade these; a regular expression or a parser will do. Tier B is difference-based: regression tests and multi-model disagreement, where the verdict comes from comparing one output against another (a saved baseline, or a peer model on the same input). The LLM is used as a diff engine, not a judge. Tier C is judgment-based: "was this response appropriate, helpful, on-policy?" This is the hardest and least reliable tier, and the honest default is to defer it to humans or outsource it carefully. When it has to be automated, an LLM-as-judge is feasible — but only when the judging criteria are themselves the product of meta-prompting, written down as a specification the judge model can apply consistently. Otherwise you've replaced one ungrounded verdict with another.

The practical consequence: push as much of your test suite as you can into Tier A, accept Tier B where comparison is meaningful, and treat Tier C as a scarce resource. Many developers get this backwards and lean on Tier C because it feels like it covers everything; in practice it covers nothing reliably.

The regression suite is the connective tissue. Every failure found at any level, in any session, in any production incident, should be encoded there before the next cycle begins. That's how the testing stack learns from its history instead of rediscovering the same failures repeatedly.

What's largely absent from the current research literature is testing for behavioral conformance to a specification. Safety testing and capability benchmarks have extensive coverage; conformance does not. Contract testing and mutation testing only become meaningful once a formal specification exists to verify against, and for most agents, no such specification exists. Building specifications is the prerequisite. Testing them is the payoff.

Bibliography

[1] Deng, W. H., Kim, S. S. Y., Jha, A., Holstein, K., Eslami, M., Wilcox, L., & Gatys, L. A. (2025). PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming. Workshop on Regulatable ML (ReML) at NeurIPS 2025. arXiv:2509.03728. https://arxiv.org/abs/2509.03728

[2] OWASP GenAI Security Project. (2025, December 9). OWASP Top 10 for Agentic Applications 2026. https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/

[3] Derczynski, L., et al. (2024). Garak: A Framework for Security Probing Large Language Models. NVIDIA. https://github.com/NVIDIA/garak

[4] Lopez Munoz, R., et al. (2024). PyRIT: A Framework for Security Risk Identification in Generative AI. Microsoft. https://github.com/microsoft/PyRIT

[5] Usseglio Viretta, A. (2026). Distilling LLM Behavior: How Meta-Prompting Extracts Transferable Design Principles (v2). Zenodo. https://doi.org/10.5281/zenodo.19865603

Suggested Reading

For the methods in this article and the specification work underlying them:

Usseglio Viretta, A. (2026). Distilling LLM Behavior: How Meta-Prompting Extracts Transferable Design Principles. Zenodo. https://doi.org/10.5281/zenodo.19865603 — The preprint on meta-prompting for behavior specification that this article builds on. If you're working on complex LLM behavior specifications, the methodology described there is the prerequisite for behavioral contract testing and mutation testing.

Deng et al. (2025). PersonaTeaming. arXiv:2509.03728. https://arxiv.org/abs/2509.03728 — The NeurIPS 2025 paper on persona-based red-teaming. Shows empirically that persona diversity improves adversarial coverage, with attack success rate improvements up to 144% over state-of-the-art baselines.

OWASP GenAI Security Project. (2025). OWASP Top 10 for Agentic Applications 2026. https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/ — The practical security framework for agentic applications, covering goal hijacking, tool misuse, and identity abuse. Required reading for anyone deploying agents that interact with external systems.

NVIDIA. Garak: LLM Vulnerability Scanner. https://github.com/NVIDIA/garak — Open-source tool for broad-spectrum adversarial probing of LLMs. Good starting point for automated adversarial testing; covers encoding bypasses, prompt injection, jailbreaks, and more.

Microsoft AI Red Team. PyRIT: Python Risk Identification Toolkit. https://github.com/microsoft/PyRIT — Microsoft's open-source framework for multi-turn adversarial attack strategies. More flexible than Garak for designing custom attack scenarios; supports multi-modal inputs and quantitative scoring.