Definition

In linguistics, prosody (pronounced /ˈprɒsədi/ pross-ə-dee, from Greek προσῳδία, prosōidía, [prosɔːdía], “song sung to music; pronunciation of syllable”) is the rhythm, stress, and intonation of speech. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance (statement, question, or command); the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of language that may not be encoded by grammar or choice of vocabulary.

Turn-taking considerations

Time pauses to encourage turn-taking at the appropriate times
Pauses are the "white space" of auditory design and guide turn-taking. "The right word may be effective, but no word was ever as effective as a rightly timed pause" (M. Twain).

Effective turn-taking drives effective conversation. In many cultures, an important cue that one person in a dialog has finished talking and expects a response from the other conversant is for that person to stop talking (Balentine, 2006; Beattie & Barnard, 1979; Heins et al., 1997; Johnstone et al., 1994; Margulies, 2005; Roberts et al., 2006; Stivers et al., 2009; Wilson & Zimmerman, 1986). That isn’t to say that one speaker never barges into another, but if someone who has been talking stops and waits, it’s pretty clear that they’ve given up the conversational floor. Research on the timing of turn-taking during American English service-based conversations over the phone indicates that pauses shorter than 250 ms will rarely trigger turn-taking, whereas pauses longer than 1300 ms will almost certainly induce the non-speaking conversant to take the floor (Beattie & Barnard, 1979; Commarford & Lewis, 2005). If this does not happen, it is indicative of a conversational problem requiring repair (Roberts et al., 2006).

Turn-taking pauses should be about 750 ms
Normally, systems should wait for users to pause about 750 ms before taking the floor. Anything less than that is not going to result in most callers taking their turn. In cases where a caller is familiar with a system, shorter will work, and they may in fact barge in over a prompt where there is no pause at all. And in cases where there is high cognitive load or the caller is unsure how to proceed, 750 ms will not be enough. But it's a good place to start.

Wait about 2 seconds before extending the dialog
If the intent is to provide a "standard" prompt, pause to give the caller time to answer, and then give them an "extra option" (e.g., “Or you can say ‘none of these.’”). In this case, there needs to be enough time for callers to react to that standard prompt. In general, about 2 sec (± 0.5 sec) is appropriate before attempting to repair the conversation by extending the dialog in this manner. In some instances, you may want to wait longer to encourage a choice from the initial prompt rather than the add-on option (e.g., "Or if none of these sound right, say 'agent.'").

Set standard no input timeouts to 3-7 seconds
For no input events (the system has stopped talking and is waiting for the user to pick up the conversation), the VoiceXML default of 7 seconds seems to work well in practice – shortening it to as little as 5 or 3 seconds also appears to work (Margulies, 2005; Yuschik, 2008). For special populations (non-native speakers, older adults) or tasks (getting a credit card number, performing steps to activate a cell phone), it’s reasonable to provide a longer timeout, anywhere from 10-30 seconds, (Dulude, 2002) or to provide a pause/resume capability.

Pauses within prompts

As discussed in the section above, the primary use of pauses is to facilitate turn-taking.

A second use is to give the listener time to process what the system has said. This is particularly useful in a menu. In the section about menus, we discussed that the caller is not expected to memorize the entries on the menu, simply to select the right one. At each item, they process the item and classify it as reject, select, or maybe. That processing takes time.

Pause at least 500 ms between options in a menu
Pauses should be inserted between menu items to allow for that processing. 500 ms between options is a good place to start. Menu items that are super short, clear, and distinct may not require 500 ms in between. Anecdotal evidence shows that menu items that are longer and more complex may take more time, say 750-1000 ms. A controlled study with the same menus executed with 250, 500, 750, and 1000 ms (McKienzie, 2009) showed that 250 was definitely too short (more error conditions) but that the other three were more or less a toss-up. A DTMF system may actually require more processing time, as for each menu item the caller has to store both the description of the choice and its corresponding number.

These pauses should be done as post-processing of the recorded prompts. Let the voice talent speak with natural pauses (which are often 150-250 ms), then insert extra silence to achieve the desired pause duration.

Concatenation

Record the biggest chunks of audio as standalone messages as you can
As much as you can, plan for segments that are complete phrases. This will sound the most natural in terms of pausing and coarticulation (see below). Storage space for more messages isn't generally a problem, and once you're doing a recording session, extra prompts are generally not that costly. Concatenation should be reserved for situations that are highly dynamic.

Consider a banking application doing a funds transfer. To confirm the transfer, you might have a sentence like this:

  • System: To confirm, you want to transfer $500 from savings to checking.

Obviously, you'll have to concatenate the $500 into the middle. But the end part could probably be recorded as a bunch of pairs rather than individual account types. Say you have savings, checking, and money market. Such systems have been built with the end piece being exactly 5 messages: "from," "to," "checking," "savings," and "money market."

You could do 6 messages like this and it sounds much better: "from checking," "from savings," "from money market", "to checking," "to savings," "to money market." This way you can have flat intonation on the "from" part and falling on the "to" part.

But why stop there? And really, at three account types, there's no reason to. This is also six messages: "from checking to savings," "from checking to money market," "from savings to checking," "from savings to money market," "from money market to checking," "from money market to savings."

We now have our sentence in three pieces: lead, dollar amount, accounts. You might even want to take it a little further. What if the most common transfer amounts were $50, $100, $500, and $1000? Record the four of those with the lead and only concatenate in the amount if it's something different than that. You get the idea. Use what you know about the usage of the system to help determine where the breaks in recordings have to go.

If you need to play back times, avoid having separate audio segments for "at", "one", "o'clock", "p", and "m" -- instead, consider a plan like "at one", "o'clock" "p.m." -- then for times not on the hour, "at one", "fifty-five" "p.m."

If your system plays back time of day (or something else that is generally concatenated) a lot, it might even be worth investing a little more up front to record individual times all at once.

Also, avoid chunking that separates articles such as “a,” “an,” and “the” from the following word (or any other combination that speakers normally run together or for which the correct article depends on the following word).

Drive concatenation decisions with coarticulation considerations
Why can't you just record each word that you plan to use in an application, then just join them (concatenate) as needed? Because coarticulation is the enemy of concatenation.

In natural, continuous speech, the tongue and lips (articulators) approach but do not reach the final positions necessary to produce "perfect" speech. If the articulators did reach their target positions, however, the resulting speech would sound unnatural (hyperarticulated) and would be much slower than natural speech. For this reason, the actual sound of any given phoneme depends on the phonemes that surround it, resulting in coarticulation . One of the amazing processes of language is how our brains unravel this and hear phonemes as discrete categories of sound (Liberman et al., 1957). A consequence of how our brains have evolved to untangle coarticulation is that when snippets of audio that were recorded in different contexts are concatenated, they sound unnatural and jarring. An understanding of the concept of coarticulation is very important when planning the recorded output of a speech application.

Use natural pause points to minimize coarticulation effects
Coarticulation does not persist over pauses, so you can use natural pause points to help define appropriate audio segments. But don't forget about inflection. In the account example above, even if there's a pause, where it is in the sentence can require different inflection.

How to record legal/non-barge-in-able messaging

Designers rarely have the only say - or sometimes any say - when it comes to legal messages. Therefore, all of these recommendations come with a caveat of "as much as possible."

Avoid playing legal messages
Rarely of interest to the callers who must wait through them, these types of messages are one of the banes of IVRs from a caller experience perspective. For this reason, avoid designing them into the IVR unless you must for legal reasons or due to some other compelling business rule (see What Not to Include at the Beginning).

Let callers barge in over legal messages
This at least gives callers some control over whether to listen to the message. For example, think of online license agreements -- in most cases, users can easily skip over them without reading every word, so why would you require callers to listen to every word if doing the same or a similar task over the phone?

Keep legal messages short
To avoid having to run a shorter message through their legal department, the first design of an enterprise's IVR might include legal messaging copied from the Web or some other written source. The written version might not be too bad to read, but might be excruciating to have to listen to in its entirety. If confronted with this situation, write a more concise version, then have your stakeholders listen to each version being read aloud. After exposure to the reality of their proposed caller experience, they may be more willing to submit a shorter version for legal review.

Consider recording legal messages and disclaimers using "radio-style"
For example, if required to play a "This call may be monitored or recorded" message up front (not a recommended practice -- again, see What Not to Include at the Beginning), consider having it recorded quickly at a lower volume -- like the disclaimer messages associated with radio advertisements.

References

Balentine, B. (2006). The power of the pause. In W. Meisel (Ed.), VUI Visions: Expert Views on Effective Voice User Interface Design (pp. 89-91). Victoria, Canada: TMA Associates.

Beattie, G. W., & Barnard, P. J. (1979). The temporal structure of natural telephone conversations (directory enquiry calls). Linguistics, 17, 213–229.

Commarford, P. M., & Lewis, J. R. (2005). Optimizing the pause length before presentation of global navigation commands. In Proceedings of HCI International 2005: Volume 2—The management of information: E-business, the Web, and mobile computing (pp. 1–7). St. Louis, MO: Mira Digital Publication.

Dulude, L. (2002). Automated telephone answering systems and aging. Behaviour and Information Technology, 21(3), 171–184.

Heins, R., Franzke, M., Durian, M., & Bayya, A. (1997). Turn-taking as a design principle for barge-in in spoken language systems. International Journal of Speech Technology, 2, 155-164.

Johnstone, A., Berry, U., Nguyen, T., & Asper, A. (1994). There was a long pause: Influencing turn-taking behaviour in human-human and human-computer spoken dialogues. International Journal of Human-Computer Studies, 41, 383–411.

Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358–368.

Margulies, E. (2005). Adventures in turn-taking: Notes on success and failure in turn cue coupling. In AVIOS 2005 proceedings (pp. 1–10). San Jose, CA: AVIOS.

McKienzie, J. (2009). Menu pauses: How long? [PowerPoint Slides]. Paper presented at SpeechTek 2009. New York, NY: SpeechTek.

Roberts, F., Francis, A. L., & Morgan, M. (2006). The interaction of inter-turn silence with prosodic cues in listener perceptions of “trouble” in conversation. Speech Communication, 48, 1079–1093.

Stivers, T.; Enfield, N. J.; Brown, P.; Englert, C.; Hayashi, M.; Heinemann, T.; Hoymann, G.; Rossano, F.; de Ruiter, J. P.; Yoon, K.-E.; Levinson, S. C. (2009). Universals and cultural variation in turn-taking in conversation. Proceedings of the National Academy of Sciences, 106 (26), 10587-10592.

Wilson, T. P., & Zimmerman, D. H. (1986). The structure of silence between turns in two-party conversation. Discourse Processes, 9, 375–390.

Yuschik, M. (2008). Silence locations and durations in dialog management. In D. Gardner-Bonneau & H. E. Blanchard (Eds.), Human factors and voice interactive systems, 2nd edition (pp. 231-253). New York, NY: Springer.