Unconstrained alphanumeric input in the English language is notoriously poor due to the phonological similarities of the letter names in the English language (e.g., Lewis, 2011; Rolandi, 2006; Wright et al., 2002). For example, "B, C, D, E, G, P, Z, 3" etc.

There are two basic strategies to improve alphanumeric recognition.

First possibility: constrain the grammar itself so that it will only return results that are valid
This can typically be done in two ways.
  • If the type of input you are collecting contains patterns, for example, the first letter of the item is always an A, this information can be used to build a grammar, so that the results returned from the grammar will always have an A in the first position of an item.
  • Or, if your data consists of a known universe of things (for example, only these particular 20,000 VIN numbers are valid) then this information could be used to build a large grammar (possibly dynamically) containing only those items which are actual valid IDs.

Notice that both of these techniques will cause the recognizer to return only valid items, regardless of what the caller actually says. This could potentially be a security concern.

For purely numeric strings, especially credit card numbers, checksum is a valuable tool to help with this. See credit card for more information on how to use it.

Second possibility: validate the input after recognition
This could be done:

Which method you choose (or combination of methods) depends on the nature of what is being recognized (the patterns in the number, the data itself) as well as any security concerns around what is being recognized.

In addition to using the actual n-best list, you can use proactive substitution when you're checking utterances against a backend. This involves taking commonly misrecognized characters and replacing them with what is often the correct utterance. For example, if the recognizer returns "WJS" you might want to look up "WJF" even if it's not anywhere in the n-best list.

For long strings, consider breaking the input into chunks
The longer an alphanumeric string is, the more likely it is that the string will have an error somewhere in it, which will make the entire string be in error. For example, suppose you have a recognition system that has an average character accuracy of 97%, so individual errors only happen 3/100 times (on average). If someone speaks 4 characters, the likelihood that the entire string will be correct is (1-(1-p))^n (where p is the average accuracy and n is the number of characters) -- in this case, 89%. For 16 characters (like a 16-digit credit card number), the full string accuracy drops to 61%. For this reason, if it makes contextual sense, consider breaking long strings up into chunks. On the other hand, many modern recognizers (especially tuned digit recognizers) have very high recognition accuracies, so it's OK to start by prompting for the full string -- especially if you can compare the n-best list to a database or set of business rules to weed out obviously incorrect strings (see Using n-best Lists). For a piecewise grammar and confirmation strategy to use when parts of a string have high recognition accuracy but another part does not, see Parkinson (2012).

For Spanish
Note that in Spanish, the letter names for B and V are especially ambiguous in some variations, so craft prompts accordingly.

References

Lewis, J. R. (2011). Practical speech user interface design. Boca Raton, FL: CRC Press, Taylor & Francis Group.

Parkinson, F. (2012). Alphanumeric Confirmation and User Data. Presentation at SpeechTek 2012, available at http://www.speechtek.com/2012/Presentations.aspx (search for Parkinson in Session B102).

Rolandi, W. (2006). The alpha bail. Speech Technology, 11(1), 56.

Wright, L. E., Hartley, M. W., & Lewis, J. R. (2002). Conditional probabilities for IBM Voice Browser 2.0 alpha and alphanumeric recognition (Tech. Rep. 29.3498. Retrieved from http://drjim.0catch.com/alpha2-acc.pdf). West Palm Beach, FL: IBM.