A/B testing refers to experiments that test two alternative designs. It is also referred to as champion/challenger. Originally made popular in efforts to increase the effectiveness of direct mail marketing, the method has become very popular in web site assessment, especially for the purpose of comparing critical binomial outcome variables such as sales conversion rates.

The alternative designs can vary one design element (e.g., the wording of a prompt) while holding all else constant, or can vary multiple design elements simultaneously. Varying multiple design elements offers the advantage of allowing the designer to make several changes that you think will improve performance, but if performance does improve, you won't know for sure which elements were responsible for the change or to what extent they drove the change. It is possible to tease these effects apart with more complex multivariate designs.

There has been recent interest in applying A/B testing to interactive voice response systems (IVRs), again with a focus on critical binomial outcome variables such as self-service task success rates.

To set up an A/B test, it is necessary to have working versions of both designs and a mechanism for randomly directing incoming calls to one of the two designs plus a way to record which is the better performing design. Without going into a lot of detail here, there are ways to statistically assess the results of this type of experiment, either with a classical test of significance or by computing binomial confidence intervals (Sauro & Lewis, 2012).

It's probably a wise idea to conduct the design performance comparison as early in the process as possible. The further into the development life cycle, the more expensive the change will be. Wizard of Oz Testing (WoZ) is a great method in this situation, because it can be employed long before any development work has begun (Usability Testing).

Alternatively, many designers have successfully deployed with two different live versions of an application taking a percentage of traffic and compared appropriate system and application metrics for a final verdict.

It may be desirable to combine the A/B Test with a post-call survey. A live survey with open-ended questions could allow for the collection of qualitative data as well as quantitative data, i.e. it allows participant comments and follow-up questions such as "What about X did you dislike?" Even an automated survey could allow for the collection of metrics such as CSAT or NPS, which could then be correlated with the version of the A/B test which the caller experienced.

Further Reading
http://www.measuringusability.com/wald.htm

References

Sauro, J., and Lewis, J. R. (2012). Quantifying the user experience: Practical statistics for user research. Burlington, MA: Morgan Kaufmann.

Leppik, Peter (SpeechTEK, 2016). A/B Testing With User Feedback