TTS (Text-to-Speech)

Technology that converts written text into natural-sounding spoken audio.

Text-to-Speech is the final step in the voice agent pipeline: converting the AI model's text response into spoken audio.

Modern neural TTS models are nearly indistinguishable from human speech. They handle proper nouns, numbers, and technical terminology correctly, and convey appropriate emotional tone.

For voice agents, TTS speed matters as much as quality. Streaming TTS begins generating audio as soon as the first words are available, reducing perceived latency.

Related terms

STT Voice Cloning