STT (Speech-to-Text)

The process of converting audio speech into written text. Also called ASR.

Speech-to-Text is the first step in the AI voice agent pipeline: turning the caller's spoken words into text that the language model can process.

Modern STT systems use transformer-based neural networks that process audio in real time. They handle accents, background noise, and varied speaking speeds with increasing accuracy.

For voice agents, STT needs to be both fast and accurate. Streaming STT, which begins processing audio before the caller finishes speaking, helps minimize latency.

Related terms

ASR TTS