Usability Testing (UT) is the process of watching and listening to "real" people use an application in likely or realistic scenarios. In contrast to Usability Assessments (inspection methods such as heuristic or expert evaluation), UT reduces subjective opinion (both experts and client management) by focusing on how people actually behave when interacting with an application. Usability Testing offers methodological controls that allow us to compare different groups of users or test competing design alternatives in a rigorous, but feasible way.

Usability Testing of speech applications follows the same general philosophy and methods as UT for other applications, but some differences exist. One way in which UT of speech applications differs from other applications is the inability to use the Think-Aloud (TA) methodology (also see Lewis, 2012). You can't ask participants to say what they're thinking as they use a speech recognition system. Instead, the test facilitator can interview the participant immediately following each interaction to obtain their reactions.

The success of UT depends primarily on three factors: 1) how well the test participants represent the background knowledge, attitudes, and situations of the people who will use the live system; 2)how well the test scenarios simulate realistic situations and provide participants with believable reasons for making calls; and 3) the degree to which the the system being used replicates the behavior of the production application.

Usability test may be conducted using speech application at many different stages throughout design and development of the application. Testing with a fully functional application generally must occur after development is complete, and
provides the most grounded, realistic data, but it may be delivered too late in the project to be maximally useful. Testing with a less-functional, less realistic application can often happen earlier, but because users are interacting with a system that is not identical with the production application, the data are not as robust. One specific early UT method used for speech applications is known as "Wizard of Oz" (WOZ) testing, which can be conducted before the real system is completed. WOZ testing is particularly valuable when there are questions about how the target audience will interact (e.g., Sadowski & Lewis, 2001), but has some limitations relative to testing with a working prototype or deployed system, notably weakness in detecting problems with recognition, audio quality, and turn-taking or other timing issues (Sadowski, 2001).

A typical usability test for a single user population requires two days of testing, with six participants each day taking up to an hour each. This is not a hard-and-fast rule -- sessions may be longer or shorter as required, and distributed over more days, especially if there are multiple distinct user groups who must be included in the test . There are statistical methods for estimating and validating sample sizes for these types of formative usability studies -- for a review, see Chapter 7 of Sauro and Lewis (2012) or Lewis (2012, pp. 1292-1297).

There are several costs involved.
  • Testing is often best accomplished locally to the client company so that test participants are either existing customers of the company, or at least resemble such customers in demographic characteristics, such as regional accents, education levels, and other factors. Local testing may therefore involve travel expenses for the test staff.
  • Although it is possible to recruit and schedule test participants on your own, this can be very time-consuming, so many usability testers prefer to engage professional recruiters who can identify, screen, and schedule a set of test participants either from the general population according to a set of criteria to match the user base, or from a list of customers you provide.
  • In most cases, participants must be motivated to devote their time and to travel to the testing site, and the best motivation is cash or gift certificates.
  • Client personnel may be needed to help set up the test system, and we highly recommend that client personnel be observers of all testing so that they can verify the validity of the test procedures by direct observation.
  • Testing must be conducted in a quiet environment without distraction of the participants or of the client observers. This may require the rental of office or office-like space.
  • Testing should be recorded, at least with audio, and preferably with video as well, both for later analysis and also to show client management who are not present as observers. There may be expenses involved for recording equipment.

It is also possible to conduct remote usability test sessions via conference call for WOZ testing. You can record these sessions for later analysis/review by using a phone tap, built-in recording facilities for IP telephony (if available), or running a video camera to record the audio from a speaker phone.

The basic deliverable from UT is a written list of specific recommendations based upon observations made during testing. Typically there will be recommendations for changes to the design of the application, and for tuning of recognition grammars. There may also be broader recommendations for changes to client procedures for serving customers so that the total customer experience of the client company is a positive and profitable one.

Secondary deliverables can include quantitative usability metrics such as task completion times, task completion rates, and satisfaction metrics (Sauro & Lewis, 2012). Chapter 6 of Sauro and Lewis (2012) provides comprehensive guidance on determining how many participants to evaluate in this type of formative usability test (also see Lewis, 2012). For a published example of a usability evaluation of a speech recognition IVR, see Lewis (2008).

References

Lewis, J. R. (2008). Usability evaluation of a speech recognition IVR. In T. Tullis & B. Albert (Eds.), Measuring the user experience, Chapter 10: Case studies (pp. 244–252). Amsterdam, Netherlands: Morgan-Kaufman.

Lewis, J. R. (2012). Usability testing. In G. Salvendy (Ed.), Handbook of Human Factors and Ergonomics, 4th ed. (pp. 1267-1312). New York, NY: John Wiley.

Sadowski, W. J. (2001). Capabilities and limitations of Wizard of Oz evaluations of speech user interfaces. In Proceedings of HCI International 2001: Usability evaluation and interface design (pp. 139–142). Mahwah, NJ: Lawrence Erlbaum.

Sadowski, W. J., & Lewis, J. R. (2001). Usability evaluation of the IBM WebSphere “WebVoice” demo (Tech. Rep. 29.3387, available at drjim.0catch.com/vxmllive1-ral.pdf). West Palm Beach, FL: IBM Corp.

Sauro, J., & Lewis, J. R. (2012). Quantifying the user experience: Practical statistics for user research. Burlington, MA: Morgan Kaufmann.