ASR (Automatic Speech Recognition)
Technology that converts spoken language into text, enabling AI systems to understand what a caller says.
Automatic Speech Recognition is the foundational technology that allows AI voice agents to understand human speech in real time. Modern ASR systems use deep neural networks trained on thousands of hours of audio data to achieve word error rates below 5%.
In the context of AI voice agents, ASR runs continuously during a call, converting the caller's speech into text that the language model can process. The speed and accuracy of ASR directly affects the conversational quality: faster transcription means lower latency, and higher accuracy means fewer misunderstandings.
Recent advances in end-to-end speech models have blurred the line between ASR and LLM processing. Native speech-to-speech models handle speech input directly, eliminating the separate ASR step and reducing latency further.