How to Test AI Voice Agents Before Shipping: Replay Testing + QA Scoring Guide
Most teams ship prompt changes to production and hope real callers don't notice the regression. There's a better way — the same CI/CD discipline engineering teams apply to code, applied to voice agent behavior. Here's the playbook.
The Production Regression Nobody Talks About
A week after you deploy your voice agent, the conversion rate drops six points. Nobody pushed a broken feature. Nobody ran an incident response. The symptom is invisible: the agent sounds fine, the calls complete, the dashboard is green. But over the course of seven days, something the agent used to do — hold a specific objection, close the booking after a single mention of pricing, recognize the emergency phrases — has quietly stopped working.
This failure mode is unique to LLM-driven voice agents. A code regression throws an exception you can trace. A voice agent regression just... happens. Somebody tweaked the persona prompt to sound friendlier and accidentally softened the qualifying questions. Somebody added a knowledge-base article that now gets retrieved too often and crowds out the objection scripts. Somebody upgraded the underlying model and the new one handles interruptions slightly differently.
The bug exists. It's shipped to production. Real callers are experiencing it. And unless you have a disciplined way to detect the regression, you find out about it from the monthly conversion report — two weeks after it started.
Two Different Problems: Scoring vs Testing
There are two distinct testing disciplines for voice agents and they solve different problems:
**QA scoring** is what happens after a call. Every completed call gets graded on goal completion, script adherence, persona consistency, objection handling, and technical quality. You get an overall score 0–100, a dimension breakdown, a list of flagged issues. It's retrospective. It tells you *what happened* on real calls, so you can catch patterns, surface the worst calls for review, and aggregate trends over time.
**Replay testing** is what happens before you ship a change. You define a suite of canned scenarios — an opening prompt plus a description of what "pass" means — and the system replays them against your agent's current configuration. It's prospective. It tells you *what would happen* on representative calls, so you can catch regressions before real callers experience them.
Most teams skip both. Some teams do one. Very few do both. Teams that do both — QA scoring for production monitoring, replay testing for pre-deploy verification — get the full CI/CD analog for voice agents: issues get caught at deploy time (replay) or within the first batch of real calls (scoring), not weeks later from slow-moving reports.
How QA Scoring Works in Practice
The simplest useful QA scoring rubric has five dimensions, each graded 0–20:
1. **Goal completion** — did the agent accomplish the template's stated objective? A lead-qualifier that ends the call without capturing budget, timeline, and authority gets a low score. A booker that transferred instead of booking gets a low score.
2. **Script adherence** — did the agent stay on the configured questions and handoff logic? Going off-script to improvise can be fine in moderation but becomes a risk signal when it happens every call.
3. **Persona consistency** — did the tone, style, and character stay in role the whole call? The friendly-but-professional agent who becomes clinical after a caller pushes back has a persona bug.
4. **Objection handling** — did the agent recover when the caller pushed back? Pricing objections, timeline objections, competing-vendor mentions. The response should be calm, on-brand, and either resolve the objection or gracefully route to the right person.
5. **Technical quality** — no dead air, no inappropriate interruptions, clean turn-taking. This catches latency issues and model failures that don't show up in the content but kill conversion.
A judge model scores each dimension from the transcript, returns a JSON breakdown, and the platform aggregates scores into per-agent dashboards. Surfacing the bottom 10% of calls is more actionable than averaging 500 calls into a single number — the outliers are where the improvement opportunities live.
How Replay Testing Works in Practice
Replay testing is the pre-deploy verification layer. The workflow:
1. **Define a suite**. Five to twenty scenarios that cover the happy path and the most common edge cases. Each scenario has a caller prompt ("Hi, I saw your ad for furnace tune-ups, how much does that cost?") and a pass criterion ("agent quotes the tune-up price, asks for the caller's address, and books an appointment").
2. **Run before shipping**. After any change to the persona, script, knowledge base, or integrations, replay the suite against the new configuration. Each case simulates a short 3–8 turn conversation where a language model plays both the caller and the agent (using the agent's actual system prompt), and a judge model evaluates the final transcript against the pass criterion.
3. **Diff vs baseline**. The most recent completed run is the baseline. New runs show which cases regressed (was passing, now failing), which cases newly started passing (was failing, now fixed), and whether net pass rate improved or declined.
4. **Ship with confidence or fix with context**. A green suite is a strong signal the change is safe. A red suite gives you the exact transcript and judge reasoning for every failing case — you see what the agent did instead of what you wanted it to do.
Suites take a minute to define and run in about 60–90 seconds for a 10-case pass. Compared to the cost of discovering a regression from conversion-rate analysis weeks later, it's essentially free insurance.
What to Put in Your First Replay Suite
The temptation is to cover every possible scenario. Don't. A first suite with 5–8 cases is more valuable than a 50-case suite that's too expensive to run on every change. Here's a practical starting point for a lead qualifier:
- **Happy path — qualified buyer**: caller matches ICP, answers questions, agrees to book. Pass: agent captures budget, timeline, and contact info, books a slot. - **Happy path — unqualified lead**: caller doesn't meet criteria. Pass: agent politely ends the call without pushing further, doesn't promise a callback. - **Price objection**: caller asks cost before agent has context. Pass: agent acknowledges, asks a qualifying question, then quotes if appropriate. - **Timing objection**: caller says "not right now, just researching." Pass: agent captures callback preference instead of forcing a close. - **Competing vendor**: caller mentions they're also talking to [competitor]. Pass: agent acknowledges without bashing, focuses on differentiators, asks for next step. - **Off-topic**: caller asks something totally unrelated (weather, politics, a pricing question about a different business). Pass: agent politely redirects to the qualification flow. - **Edge case — emergency**: if your template handles emergencies, include one. Pass: agent routes to emergency flow, not normal booking.
Run this suite before every persona/script/KB change. Add a new case every time a real caller produces a surprising failure you want to prevent next time. Over a quarter, your suite evolves into a living specification of what the agent must be able to do — the most valuable internal documentation you'll produce for the team.
Related articles
AI Receptionists vs. Human Receptionists: An Honest Comparison
AI receptionists and human receptionists each have clear advantages. This is an honest comparison covering cost, availability, accuracy, warmth, and the situations where each one wins.
AI Lead Qualification: How It Works and Why It Matters
AI lead qualification uses voice agents or chatbots to evaluate new leads against your ideal customer criteria within minutes of inquiry. This guide covers the technology, the process, real-world results, and how to evaluate platforms.
Ready to try Stellar?
Create your first AI voice agent in minutes.