Don’t give the caller an upfront choice
This is a specific instance of the general guideline to avoid cluttering the beginning of a call. For more information, see What Not to Include at the Beginning, specifically, the subsection entitled "Prompts for Touchtone versus Speech."

Switching
After repeated speech errors, some systems will switch the rest of the call to DTMF. This takes some upfront planning and work to implement the dialog steps in both speech and DTMF, but, as described in more detail below in the section on Error Recovery, allows callers to continue self-routing and self-service when speech doesn't work well. This strategy can keep callers in the system who would otherwise get transferred to a customer service representative, whether or not the caller wanted to transfer. In making the decision about what to do when speech fails, companies need to balance their relative costs and benefits of providing a touchtone fallback versus faster transfer to agents. "For example, enterprises that face significant competition and with customers who can easily change providers might choose a strategy of rapid transfer over touchtone fallback" (Lewis, 2011, p. 186).

Promote Where Appropriate
Provide a touchtone alternative for tasks that are easier for DTMF than speech. The best example of this is the entry of numeric strings, such as PIN codes, account numbers, social security numbers, etc. These types of prompts usually start with the phrase "Say or enter ...," which is a concise way to let callers know they have a choice (and one with which callers tend to be familiar). Analyses of actual usage of these types of prompts shows that most callers choose to use the keypad (60% reported by Suhm, 2008; 90% reported by Attwater, 2008).

Another approach is to use mode-neutral prompts such as, "Next, what's your PIN?" Ideally, the application should accept either speech or DTMF for this prompt. This is not as effective as "say or enter," especially if it follows several speech prompts. Callers may not realize that they can use DTMF.

Error Recovery: Use DTMF as an alternate input style for callers who have trouble using speech

It’s always a good idea to allow users to fall back to DTMF input in cases where users are experiencing multiple errors with speech. No matter how robust a system’s grammars are, there will be times when the user’s environment - barking dog, loud television, competing voices - will make speech recognition extremely challenging. If possible, consider turning off the recognizer and switching to a DTMF-only mode if the system reaches a predetermined No Match threshold.

When callers have difficulty with a speech recognition IVR, it is a common practice to allow callers to use DTMF (the touchtone keypad) as an alternative input mechanism (Balentine, 2010; Rolandi, 2004a; Suhm, 2008). Callers might have problems using speech for reasons such as:
  • Caller accent
  • Speech disability
  • High ambient noise
  • Desire to quietly enter sensitive information (e.g., a PIN)

Use an established approach to mixed-mode prompting
There are two fundamentally different approaches to mixed-mode (speech plus touchtone) prompting, both in current use:
  • Keep speech enabled for the entire dialog step (which allows a reprompt like, "Say Checking (or press 1), Savings (2), or Money Market (3)").
  • Disable speech when switching to touchtone (at which time a reprompt would be DTMF-only, such as "For Checking, press 1. Savings, 2. Or Money Market, 3").

To determine which approach is better for a given application, it's necessary to know what kind of problems users are likely to have.

If the primary problem is caller accent or disfluency, then there is no need to disable speech.

If the primary problem is ambient noise, then it may be necessary to disable speech to prevent the system speech from cutting off before the caller hears the prompts/messages.

Research continues in making speech recognition more robust in the face of ambient noise, but this is still a significant problem for current technologies (Karray & Martin, 2003), just as it is for human-human dialogs (McKellin et al., 2007).

"Because disabling speech barge-in can cause significant usability problems with speech input, it is better to disable speech input altogether and to stay in touchtone mode following multiple speech failures [in the condition of high ambient noise], still permitting touchtone barge-in" (Lewis, 2011, p. 186).

If disabling speech and switching to DTMF-only for the rest of the session, consider playing a concise prompt such as, "Let's use the keypad from here on."

Put the "press X" after the option
Because of the end-focus principle, in the vast majority of cases you want to put the action after the target. Thus:

  • For A, press 1. For B, press 2.

This holds for both initial prompts in DTMF only situations, and for reprompts in speech systems that use DTMF as backup. In the latter case, it'll look more like this:

  • Say A or press 1.

There is some evidence that for longer lists of very short, familiar, mutually exclusive options, it works to put the number first. For example:

  • Press 1 for Yes, 2 for No.

Callers know exactly what they have and are fine with this because they know the choices. If it’s a list of new or otherwise unknown choices, it won’t work to put numbers first.

Avoid "Press or say <x>"
Avoid using prompts like, "For checking, press or say 1. For savings, press or say 2. For money market, press or say 3."

This is a common (but far from leading) practice. Lewis (2007) searched the Web and found numerous examples of applications using this prompting style, but no examples of designers promoting it. This type of prompting inherits the weaknesses of DTMF-only applications (needing to remember a number instead of the desired function) and speech applications (possibility of misrecognition), without getting any of the advantages of well-designed speech applications. Designers have argued against this practice for over ten years:

"Asking the user to speak a digit for menu selections or other non-numeric data -- simply to emulate the touch-tone keypad -- is extremely awkward. Although speech recognition technologies of several years ago were limited to such vocabularies, this is no longer the case" (Balentine & Morgan, 2001).

Specify Modality in Prompts
To minimize caller errors, it is important to promote the input modality that you are soliciting from the caller in the verbiage of the prompt. For example, if you want speech, then solicit the caller to "Say A." If you want DTMF, then solicit the caller to "Press N."

If you will accept both modalities, then it is preferable to promote only one input modality in the main menu and offer the other input modality only if the caller seems to be having trouble with the primary modality. Note that this recommendation is for menus. On other forms of input (like account numbers), then an explicit "say or enter" is best.

Here's an example of an initial prompt:

  • Which account do you want to transfer funds from? You can say checking, savings, or money market.

In the event of an error, the reprompt might look like this:

  • Tell me which account to transfer the funds from. You can say checking or press 1. Say savings or press 2. Last, say money market or press 3.

Guidelines elsewhere talk about under what circumstances you'd want to taper the say/press instructions (e.g., more concisely: "You can say checking or press 1; savings, 2; or money market, 3.").

Mixing
Make it clear to the user whether the system is expecting speech or DTMF input, and keep the input type consistent throughout the interaction whenever possible. Switching between speech and DTMF in the absence of an error condition can be confusing, leading to indecision and silence from the user, which in turn can lead to an error condition. If faced with a need to mix modalities (e.g., a choice from a speech-enabled main menu directs the call to a touchtone subroutine), you might want to let the caller know what's going on with additional (hopefully concise) messaging. Note that a caller who has been transferred from a speech to a touchtone application may well understand what's happening without any additional messaging or prompting, just by virtue of how touchtone prompts are designed, in which case there is no need for any additional messaging. When designing, put yourself in the caller's shoes, work out sample dialogs, and explore whether additional messaging would potentially benefit callers or just be superfluous. Also, be sure to include tasks in your usability testing that will help you determine how well your design is working.

DTMF-Only Menus and Applications
There are times when you want to create a pure DTMF (touchtone) application. There are many reasons for this: an international caller base where even if everybody knows the same language, it's not their first language; an application targeted to an environment that is known to be noisy; a simple task or menu where speech doesn't buy you anything; or the Powers That Be decreed it so.

Whatever the reason, most design principles carry over, but there are a few differences. Most significantly, with touchtone applications, you as a designer only have the phone keypad to work with, which limits interaction to selection from a menu or numeric input. It is possible to devise schemes for entering text, but these methods tend to be difficult to use because it's necessary to either disambiguate which letter a caller intends for each entered character (e.g., using multikey or multitap -- see Lewis et al., 1997, pp. 1292-1293 or Lewis et al., 2008, p. 113) or to disambiguate the entered string with some sort of database look-up. This places significant limitations on the kinds of tasks that callers can perform with a DTMF-only application.

If a broad menu goes beyond 9 items, necessitating double digits, it is better to split the menu into two menus. This will lead to a cleaner design and eliminate the risk of callers accidentally going down the wrong logic leg of a call flow due to any substantial latency between the first and second digits.

An exception would be for an application that is solely used by power users like an internal system where callers are very familiar with the system and will appreciate the speed of a single layer menu.

Special keys (i.e., star, pound, zero) should be avoided for use by menu options. Zero should never be used for anything other than operator. Star and pound are often reserved for universal navigation, and even if they are not in your system, assigning menu options to them is confusing to callers.

If the IVR will make use of dynamic menus (only a subset of menu options will be offered based on some characteristic of the caller), then it is better to renumber the DTMF options of the menu (i.e., 1-2-3-4) rather than leave gaps (i.e., 1-2-4). However, it should be noted that renumbering menus can have negative effects if the application has a large base of power users that rapidly navigate the system from memory.

Applying the principle of tapering, respect for the caller's time, and the conversational maxim of quantity (see Conversational Maxims), strongly consider playing "press" for just the first option in a menu and, depending on how it sounds, perhaps the last. Callers know how to use DTMF touchtone menus, and know that when they hear a number associated with an option that if they want that option they should press the indicated key (for example, "For checking, press 1. Savings, 2. Money market, 3. And for anything else, press 4."). Dropping the extra "press"'s leads to prompts that take less time to play, are just as easy to use, and sound fresher than prompts with the robotic repetition of "press". Note that this strategy might not work well when the user's choices include numbers -- if that's the case, you'll need to include "press" before each DTMF number.

References

Attwater, D. (2008). Speech and touch-tone in harmony [PowerPoint Slides]. Paper presented at SpeechTek 2008. New York, NY: SpeechTek.

Balentine, B. (2010). Next-generation IVR avoids first-generation user interface mistakes. In W. Meisel (Ed.), Speech in the user interface: Lessons from experience (pp. 71–74). Victoria, Canada: TMA Associates.

Balentine, B., & Morgan, D. P. (2001). How to build a speech recognition application: A style guide for telephony dialogues, 2nd edition. San Ramon, CA: EIG Press.

Karray, L., & Martin, A. (2003). Toward improving speech detection robustness for speech recognition in adverse conditions. Speech Communication, 40, 261–276.

Lewis, J. R. (2011). Practical speech user interface design. Boca Raton, FL: CRC Press, Taylor & Francis Group.

Lewis, J. R. (2007). Advantages and disadvantages of press or say <x> speech user interfaces (Tech. Rep. BCR-UX-2007-0002. Retrieved from http://drjim.0catch.com/2007_AdvantagesAndDisadvantagesOfPressOrSaySpeechUserInter.pdf). Boca Raton, FL: IBM Corp.

Lewis, J. R., Commarford, P. M., Kennedy, P. J., and Sadowski, W. J. (2008). Handheld electronic devices. In C. Melody Carswell (Ed.), Reviews of Human Factors and Ergonomics, Vol. 4 (pp. 105-148). Santa Monica, CA: Human Factors and Ergonomics Society. Available at http://drjim.0catch.com/2008_HandheldElectronicDevices.pdf.

Lewis, J. R., Potosnak, K. M., and Magyar, R. L. (1997). Keys and keyboards. In M. Helandar, T. K. Landauer, and P. Prabhu (Eds.), Handbook of Human-Computer Interaction (pp. 1285-1315). Amsterdam: Elsevier. Available at http://drjim.0catch.com/1997_KeysAndKeyboards.pdf.

McKellin, W. H., Shahin, K., Hodgson, M., Jamieson, J., & Pichora-Fuller, K. (2007). Pragmatics of conversation and communication in noisy settings. Journal of Pragmatics, 39, 2159–2184.

Rolandi, W. (2004a). Improving customer service with speech. Speech Technology, 9(5), 14.

Suhm, B. (2008). IVR usability engineering using guidelines and analyses of end-to-end calls. In D. Gardner-Bonneau & H. E. Blanchard (Eds.), Human factors and voice interactive systems, 2nd edition (pp. 1-41). New York, NY: Springer.