People are very sensitive to the quality of the speech they hear (Bailly, 2003). High-quality conversational IVR applications primarily use recordings of professional voice talents for the system voice, sometimes supplemented with artificial speech (text-to-speech, or TTS) for unbounded text (text which is difficult or impossible to predict -- e.g., new book or movie titles). Lower-cost conversational systems may rely exclusively on TTS (e.g., in-vehicle or mobile devices). Research on a standardized assessment questionnaire (the MOS-X – Polkosky & Lewis, 2003) indicates four components of user satisfaction with speech output: Intelligibility, Naturalness, Prosody, and Social Impression.

Voice Talent



Audio Recording Considerations


