Spoken Dialogue 'v' Touch Tone (DTMF) Based Interactive Voice Response Services

Touch Tone Based Interactive Voice Response Services are already familiar, but can be frustrating to use and many people find them inadequate - a major drawback being that callers have to memorise an artificial mapping between numbers on a keypad and various actions (e.g. "Press 1 for current account balance, Press 2 for savings account balance, etc). To avoid callers being overwhelmed by options, high level descriptions are used at the main menus to represent the myriad of functions the service is trying to perform, making option choice difficult. With speech recognition systems that recognise individual keywords, callers do not need to be "taught" the key to action map, instead, the caller is instructed to use one of several spoken keywords (e.g. "Please say: Balance, Savings, Help or Operator, now…" The disadvantage of this is that the keywords have to be listed, callers still have to work through a menu hierarchy, With Spoken Natural Dialogue, however, the caller can state a request without listening to a list of commands, and can navigate straight to a specific command (e.g. "I want to know my account Balance."), using natural-sounding dialogue. Spoken Natural Dialogue systems are considerably more complex to design and implement than touch-tone or command word systems, and one might expect such a system to be hopelessly error prone because of word recognition errors. But, word recognition accuracy is not such a problem in order to achieve correct overall dialogue accuracy, due to recognition of the phrase content in context with the process being accessed in the application. (e.g. "I want to know my Bank Balance" - with a word accuracy of 70%, the phrase recognition accuracy would still be high, because the voice browser would be expecting the words like "Bank" and "Balance", these being keywords in the application context).


How Human should it be?

An important interface consideration is the vocal quality used in the interaction with the caller. Human beings respond in different ways depending on the "friendliness" of the voice that is communicating with them. Pre-recorded messages of human speech can be used in applications where the text is static (unchanging), but this can be very limited, especially when the data to be spoken is not known in advance. Computer synthesised Text-to-speech is easier to maintain and modify and is a powerful tool for reading out-loud previously unknown or dynamically changing text, but, this can not yet mimic the complete naturalness of human speech and can sound so robotic, that the caller becomes bored or even annoyed, loses focus or just simply will not listen - not very user friendly. Fortunately, there are great advances happening in this field, but, if it sounds too human, the caller may be fooled into thinking that the service has a greater ability than it actually has and will not be so clear in the spoken dialogue with the machine, which would then increase dialogue recognition errors. This could also increase frustration in the caller and create even more voice recognition errors due to raised speech volume, and emotional vocal distortions. The caller needs to realise that it is a machine that they are communicating with. Companies such as AT&T, have done a lot of research on the subject. Using a study called "How may I help you?", discovered that people best responded with an "audio logo" played at the beginning of a "call greeting" as a way of letting callers know that they are talking to a machine system, and, the use of a "more friendly voice" to communicate the service. The caller could be given a choice of vocal style to suit. In studies it was found that voice gender, though, irrelevant to computers, had an effect on the caller. With a "male" voice people responded better when in the context of being instructed on a typically "male" topic, and with a "female" voice, responded better in a typically "female" topic.

Speech Recognition Accuracy

Speech recognition accuracy in the spoken dialogue is achieved by creating the voice application using a series of speech recognition 'grammars', defining the words and phrases that can be spoken by the caller, and which are specified where each grammar should be active within the application. It is important to ask the right questions, using non-ambiguous prompts. Other factors that affect speech recognition accuracy include: Audio input quality using a particular headset microphone or telephone. The speaking environment (which could be a noisy, crowded room - people speak differently in noisy conditions to make themselves understood). Certain caller's voice characteristics such as accent and fluency (timing and pausing).

