When barge-in is enabled, the system is always listening for the caller to speak and will stop speaking after detecting caller speech.

Types of barge-in

Older systems did not have barge-in capability, and required specialized user interface design. Because all modern systems permit barge-in, we will not address those specialized design principles. For information about them, see Balentine, et al. (1997) and Balentine & Morgan (2001).

There are two types of barge-in for VoiceXML-compliant recognizers: hotword and speech.
  • With speech barge-in, systems stop speaking as soon as they detect incoming speech or speech-like sounds.
  • With hotword barge-in, systems continue speaking until they detect that the caller has spoken a valid word or phrase from a currently active grammar.

Lombard speech and the "stuttering effect"

In noisy environments, the tendency of speakers to raise their voice or otherwise exaggerate speech is known as Lombard speech (Lombard, 1911).

After a caller begins speaking to barge into system speech, if the system speech continues playing for more than 300-500 ms, this can trigger a "stuttering effect" (Balentine & Morgan, 2001). Callers think the system didn't hear them, so they stop saying what they were saying and start over, which often completely confuses the recognizer. This can lead to a series of usability issues as the caller and system get out of sync. Note that the time of 500 ms is consistent with the amount of time it takes to resolve initiative conflicts in human-human conversation (Schegloff, 2000; Yang & Heeman, 2010).

Barge-in recommendations

In general, applications should have barge-in enabled
System speech should stop within 500 ms of the time a user begins barging in.

Failure to quickly stop system speech when callers barge in can cause serious usability problems, such as the "stuttering effect" (Balentine & Morgan, 2001; Schegloff, 2000; Yang & Heeman, 2010).

Potential exceptions to this guideline include disabling barge-in to:
  • Play messages to callers that they must hear for legal reasons
  • Prevent high ambient noise from constantly interrupting system speech -- usually done in connection with switching from speech to touchtone-only (DTMF) input, while still allowing touchtone barge-in. See also Chapter 9.
  • Prompts very early on in a system. If early pilot tests show that there is a lot of side speech at the beginning of the call leading to turn-taking issues, you may want to disable barge-in at the very beginning so that callers are engaged and stop their side conversations before the recognizer picks them up.

For most applications, enable speech-based barge-in
The key advantage of speech-based barge-in is that it leads to interactions that are more like normal human-human dialogs, minimizing the effects of Lombard speech and the stuttering effect. Its primary disadvantage is its susceptibility to background noise and speech not intended for the system.

If using hotword barge-in, strive for very concise prompting and responding
The key advantage of hotword barge-in is its resistance to accidental interruption. Its primary disadvantage is its tendency to trigger the stuttering effect. This tendency can be overcome if users are trained in how to use the system effectively. If the design is for untrained callers in conditions of high ambient noise, one way to minimize the need to barge in is to keep the prompting and messaging very concise so the system speech finishes before a caller would have time to barge in. After all, even when a system allows barge-in, many callers prefer not to do so (Suhm, 2008). Another strategy is to promote short caller responses (trying to keep them to no more than 2-3 syllables). The typical time required to produce a syllable of speech is about 200 ms (Crystal & House, 1990; Massaro, 1975), so responses that are 2-3 syllables in length are less likely to trigger the stuttering effect. Also, provide plenty of pauses to give callers opportunities to begin speaking without actively interrupting the system.

Use hotword barge-in for wait/continue situations
Sometimes a system will ask the caller for a piece of information they may not have readily available. The system will ask if they have it, and if they say no, respond with something along the lines of
  • OK, I'll wait while you go get it. Just say "continue" when you're ready to go on.

For this type of situation, switch the barge-in settings to hotword detection so that any ambient noise while they're getting the needed information doesn't incorrectly restart the conversation.

If disabling barge-in, consider letting callers know and keep the messaging as concise as possible
If there is a need to play information that the caller must hear, then it will be necessary to disable barge-in. With barge-in disabled, it is critical to craft all messages to be as concise as possible, including any messaging used to let the caller know that they won't be able to interrupt the system. Depending on the exact reason for disabling and where it occurs in the call flow, notifying them may or may not be a good idea. If the message itself would cause more disruption than not playing it, leave it off.

References

Balentine, B., Ayer, C. M., Miller, C. L., & Scott, B. L. (1997). Debouncing the speech button: A sliding capture window device for synchronizing turn-taking. International Journal of Speech Technology, 2, 7–19.

Balentine, B., & Morgan, D. P. (2001). How to build a speech recognition application: A style guide for telephony dialogues, 2nd edition. San Ramon, CA: EIG Press.

Crystal, T. H., & House, A. S. (1990). Articulation rate and the duration of syllables and stress groups in connected speech. Journal of the Acoustical Society of America, 88, 101–112.

Lombard, E. (1911). Le signe de l’elevation de la voix. Annales des maladies de l’oreille et du larynx, 37, 101–199.

Massaro, D. (1975). Preperceptual images, processing time, and perceptual units in speech perception. In D. Massaro (Ed.), Understanding language: An information-processing analysis of speech perception, reading, and psycholinguistics (pp. 125–150). New York, NY: Academic Press.

Schegloff, E. A. (2000). Overlapping talk and the organization of turn-taking for conversation. Language in Society, 29, 1–63.

Suhm, B. (2008). IVR usability engineering using guidelines and analyses of end-to-end calls. In D. Gardner-Bonneau & H. E. Blanchard (Eds.), Human factors and voice interactive systems, 2nd edition (pp. 1-41). New York, NY: Springer.

Yang, F., & Heeman, P. A. (2010). Initiative conflicts in task-oriented dialogue. Computer Speech and Language, 24, 175–189.