Choosing voice talent

Work with a professional voice talent agency
Voice talent agencies have a lot of experience in IVR voice segment production, and because they rely quite a bit on repeat business, are highly motivated to work with VUI designers and clients to select a high-quality voice talent consistent with enterprise branding (Graham, 2005, 2010). Their talents tend to stick around for a long time, which is critical for ongoing changes unless you want to re-record the entire system. They also provide consistency between sessions.

Maintain consistency across the brand
If a voice talent already represents the brand in other media, you can consider using that talent for the IVR. However, this is generally not practical if current branding employs a celebrity voice. Celebrities are often pricey, not readily available, and generally do not have experience producing voice segments for IVRs. Another thing to consider is if the image the celebrity presents in conjunction with the company is consistent with the customer service the IVR will be supplying.

If there is professional voice talent doing branding in other media, then there's a much stronger case for using the same talent for the IVR for consistency.

Give the client choices
If seeking a new voice talent, keep the client in the loop. It usually works well to provide clients with samples of three or four voices so they can choose the voice they feel best represents their company. Letting the stakeholders vote privately makes for a fun reveal and discussion of why voices were chosen.

What you will often find is that there isn't a clear favorite. This is OK. There are lots of voices that can pull off any given design. Always give your stakeholders choices where you'd be happy with any outcome. Equally important to the voice itself is the talent's ability to respond to coaching and deliver the messages the way they were intended.

Consider gender
See the gender section below for more details. Bottom line is it doesn't matter a whole lot. Ask the client if they have a preference. If you as the designer feel something about the corporate culture lends itself to one or the other, then make that recommendation.

Involve the right stakeholders in the decision
Make sure the highest-level executive who cares about the IVR voice is engaged in the selection process. "Trust me—you do not want to be in a meeting where you’re presenting the working version of the application (including all professional recordings) to the senior vice-president in charge of customer care who, upon hearing the voice for the first time, says, 'I hate it. We need a different voice'" (Lewis, 2011, p. 103).

Gender

Do not overemphasize gender
There is no compelling research to indicate an advantage based solely on the gender of the voice talent (Couper, Singer, & Tourangeau, 2004; Lewis, 2011). For average listeners in normal channels, “…there is little evidence to suggest that one sex of speaker is more intelligible than another, if other factors are ruled out. For example, males may typically have louder voices than females, and female voices may be more high-pitched than males, but if these factors are controlled for, any sex differences usually disappear” (Edworth & Hellier, 2005).

There is a general tendency in the US to use a female voice for IVRs (likely due to their service-provider orientation -- for a historical perspective, see Yellin, 2009), but there are numerous examples of successful use of male voices in IVRs. Find out if your client cares and, if so, take that into account when selecting a voice or set of voices to review.

There is no question that we all carry conscious and unconscious stereotypes in our heads. In recent years, the psychologist most strongly associated with research in how these stereotypes affect human-computer interaction is Clifford Nass (Nass & Brave, 2005; Nass & Yen, 2010; Reeves & Nass, 2003), most notably in the book, "Wired for Speech: How Voice Activates and Advances the Human-Computer Relationship". In that book, Nass and Brave (2005) described experiments in which different types of people used speech applications (notably, with most of the experiments using TTS rather than professional voice talents for their audio). In most of the studies they replicated classic social psychology studies of interactions between humans, replacing one of the humans with a speech-enabled computer, a variation of the “computers as social actors” (CASA) paradigm.

For example, they replicated the “similarity attraction” effect, the finding that people are attracted to other people who are similar to themselves. In these laboratory experiments, extroverts preferred an extroverted user interface and males preferred to hear a male voice. It turns out, however, that it is difficult to apply many of these findings to user interface design (e.g., how would you know in advance if a caller were male or female, introvert or extrovert). Additionally, they reported that people tend to rate male voices as more trustworthy (especially male listeners), and to expect females to be more nurturing.

Despite the reliability with which these social effects appear in replications of social psychology experiments, they are not as reliable when assessed in real-world systems that are otherwise usable, that is, efficient, effective, and pleasant (Balentine, 2007). Lewis (2011), in an analysis of data from studies of the perception of the quality of TTS voices (both male and female) rated by both males and females, did not find any significant Voice Gender by Listener Gender interaction, an interaction that the similarity attraction hypothesis would have predicted (and an effect replicated by Machado et al., 2012). Couper, Singer, and Tourangeau (2004) studied the influence of male and female artificial voices on more than 1000 respondents to an IVR survey on sensitive topics. They measured respondents’ reactions to the different voices and abandoned call rates, and found no statistically significant results related to the gender of the voices. In particular, there were no significant Voice Gender by Respondent Gender interactions.

“Why such strong effects of humanizing cues are produced in laboratory studies but not in the field is an issue for further investigation. … Across these studies, little evidence is found to support the ‘computers as social actors’ thesis, at least insofar as it is operationalized in a survey setting” (Couper et al., 2004, p. 567).

Coaching, inflection

Use a coach during the recording session
The coach must be familiar with the design
Even though you have hired an agency to do the recordings with professional voice talents in a professional recording studio, it's unrealistic to expect the voice talent to understand all the nuances of expression when reading from a written recording manifest (list of the audio segments to record). You need to have someone present during the recording session(s) to coach the voice talent regarding context, emphasis, and appropriate tone. Note that most agencies have the capability for coaches to phone into the session. And if they don't, you probably want to find another agency. Coaches will sometimes travel to the talent for a long session for the initial release of a system, but over the phone coaching is more the norm, especially for subsequent sessions. Not coaching is not an option.

The coach needs to be able to decide on the fly when it's appropriate and necessary to interrupt the voice talent. The coach needs to foster a relaxed environment and to listen attentively throughout the session. Coaches must be familiar with the target voice attributes, have a good ear for subtle voice differences, and be able to guide the voice talent without being offensive. The coach also has to be familiar with the system being recorded and the context of each prompt. In a lot of cases the coach and the designer are one and the same because of that familiarity with the system. Whoever wrote it knows what it's supposed to sound like. However, not all designers make good coaches.

Add coaching notes to the design
Note in the recording manifest when it is important to emphasize a word or phrase. This can make all the difference between a prompt that guides the caller to say something the grammar can understand or misleads the caller into saying something out of grammar.

For any given sentence or phrase, there are many ways to speak it, only one or a few of which will be appropriate in a given context. For example, what is the correct way to record the question (appearing in a list of frequently asked questions), “What happens after I apply for cash assistance?” Should the speaker emphasize “What,” “happens,” “after,” “apply,” or “cash assistance”?

The answer depends on the question's context. If the surrounding items concern other aspects of applying for and getting cash assistance, then plan to emphasize “after,” contrasting it with the things that happen before applying. If the surrounding items have to do with other types of assistance such as food stamps or health benefits, then plan to emphasize “cash assistance.” It’s critical to get the prosodic element of contrastive stress correct (Cohen, Giangola, & Balogh, 2004; Lewis, 2011).

Usage notes are also helpful, especially when recording small pieces that will later be concatenated together. Knowing that something will be an element in a list or the last thing in a fill-in-the-blank sentence makes all the difference in the world in how it's recorded.

These notes in the manifest are all the more important if the designer is not the coach. They will be invaluable to the coach and voice talent.

References

Balentine, B. (2007). It’s better to be a good machine than a bad person. Annapolis, MD: ICMI Press.

Cohen, M. H., Giangola, J. P., & Balogh, J. (2004). Voice user interface design. Boston, MA: Addison-Wesley.

Couper, M. P., Singer, E., & Tourangeau, R. (2004). Does voice matter? An interactive voice response (IVR) experiment. Journal of Official Statistics, 20(3), 551–570.

Edworthy, J. & Hellier, E. (2006). Complex nonverbal auditory signals and speech warnings. In (Wogalter, M. S., Ed.) Handbook of Warnings (pp. 199-220). Mahwah, NJ: Lawrence Erlbaum.

Graham, G. M. (2005). Voice branding in America. Alpharetta, GA: Vivid Voices.

Graham, G. M. (2010). Speech recognition, the brand and the voice: How to choose a voice for your application. In W. Meisel (Ed.), Speech in the user interface: Lessons from experience (pp. 93–98). Victoria, Canada: TMA Associates.

Lewis, J. R. (2011). Practical speech user interface design. Boca Raton, FL: CRC Press, Taylor & Francis Group.

Machado, S., Duarte, E., Teles, J., Reis, L., & Rebelo, F. (2012). Selection of a voice for a speech signal for personalized warnings: The effect of speaker's gender and voice pitch. Work, 41, 3592-3598.

Nass, C., & Brave, S. (2005). Wired for speech: How voice activates and advances the human-computer relationship. Cambridge, MA: MIT Press.

Nass, C., & Yen, C. (2010). The man who lied to his laptop: What machines teach us about human relationships. New York, NY: Penguin Group.

Reeves, B., & Nass, C. (2003). The media equation: How people treat computers, television, and new media like real people and places. Chicago, IL: University of Chicago Press.

Yellin, E. (2009). Your call is (not that) important to us: Customer service and what it reveals about our world and our lives. New York, NY: Free Press.