Latency

The delay between a caller finishing a sentence and the AI agent beginning its response. Sub-500ms is considered natural.

In voice AI, latency is the most critical metric for conversation quality. It measures the time from when a caller stops speaking to when the AI agent begins its response. Natural human conversation has response gaps of 200-400ms.

Total latency is the sum of several components: speech-to-text processing, language model inference, and text-to-speech generation. Each step adds delay.

Modern systems achieve sub-500ms latency through streaming audio processing, edge-deployed models, response pre-generation, and end-to-end speech models. At this speed, most callers cannot distinguish the AI from a human agent.

Related terms

ASR TTS