People are very sensitive to the quality of the speech they hear (Bailly, 2003). High-quality conversational IVR applications primarily use recordings of professional voice talents for the system voice, sometimes supplemented with artificial speech (text-to-speech, or TTS) for unbounded text (text which is difficult or impossible to predict -- e.g., new book or movie titles). Lower-cost conversational systems may rely exclusively on TTS (e.g., in-vehicle or mobile devices). Research on a standardized assessment questionnaire (the MOS-X – Polkosky & Lewis, 2003) indicates four components of user satisfaction with speech output: Intelligibility, Naturalness, Prosody, and Social Impression.

Voice Talent

Prosody

TTS

Audio Recording Considerations

References

Bailly, G. (2003). Close shadowing natural versus synthetic speech. International Journal of Speech Technology, 6, 11–19.

Polkosky, M. D., & Lewis, J. R. (2003). Expanding the MOS: Development and psychometric evaluation of the MOS-R and MOS-X. International Journal of Speech Technology, 6, 161–182.