Automated speech recognition (ASR) systems have greatly improved in recent years as better algorithms and acoustic models are developed, and as more computer power can be brought to bear on the task. An ASR system running on an inexpensive home or office computer with a good microphone can take free-form dictation, as long as it has been pre-trained for the speaker's voice. Over the phone, and with no speaker training, a speech recognition system needs to be given a set of speech grammars that tell it what words and phrases it should expect. With these constraints a surprisingly large set possible utterances can be recognized (e.g., a particular name out of thousands). Recognition over mobile phones in noisy environments does require more tightly pruned and carefully crafted speech grammars, however. Today there are many commercial uses of ASR in dozens of languages, and in areas as disparate as voice portals, finance, banking, telecommunications, and brokerages.
Advances are also being made in speech synthesis, or text-to-speech (TTS). Many of today's TTS systems still sound like "drunken robots", and can be hard to listen to or even at times incomprehensible. But waveform concatenation speech synthesis is now being deployed. In this technique, speech is not completely generated from scratch, but is assembled from libraries of pre-recorded waveforms. The results are promising.
It's important to note here that VoiceXML can be used without ASR or TTS: users can listen to recorded audio and press keys in response. Speech technology makes applications more powerful and pleasant to use, but VoiceXML brings the advantages of web development and deployment to older styles of computer telephony applications as well.
The future will bring more web devices: overnight delivery drop off boxes that schedule pickups and record their contents, networked MP3 portables, vending machines that reorder supplies when running low, wall displays that download artwork, web-based stereo receivers and televisions, and many others.
Speech technology, as it improves, will become a very natural and powerful interface for these ubiquitous web devices. Microphones are much smaller than keyboards and keypads; speakers are smaller than screens. So it seems quite likely that many future web devices will have on-board speech recognition (as do some mobile phones today), or perhaps that we'll carry voice-activated universal remotes to talk to the devices in our immediate surroundings.
Source: VoiceXML Forum
With VoiceXMl you can connect to the internet using a phone instead of a web browser. To do this you must call a server which runs a voice browser , this is your voice portal to the intenet.
Figure 1 : Static VoiceXML Application
Dynamic voiceXML applications function in much the same way, with the exception that some or all of the pages are generated dynamically, by a server based technology such as JSP/JavaBeans, PHP, ColdFusion, or scripting languages such as perl. An overview of this model is shown below.
Figure 2 : Dynamic VoiceXML Application
In this way the voice application and other screen based web data such as HTML can share the same delivery platform.
Here is a voiceXML page fragment that demonstrates a classical dtmf-based IVR menu:
The phone is important. Phones are everywhere in the developed world, and there are far more of them than internet connected computers. Mobile phones are achieving large penetration rates too: they are small, light, inexpensive, and have a long battery life, making them far more portable than computers. They can be used for applications that aren't feasible on computers, such as location-based services. Phones don't have to be booted up, and can be used while driving (though not always safely).
Voice is also important on the phone. For instance WAP is an useful technology, but WAP screens are small and can be restrictive, and keypad input can be difficult. WAP is far harder than voice to use while driving, and it is available only in a tiny percentage of phones and geographic regions. The i-mode system is more compelling, but shares many of these limitations. However, graphics is still important, and before long we'll see "multi-modal" devices which run applications that are voice-only, or graphics-only, or a mix of voice and graphics ("say the name of the city, or select it from the list"). The right modality can then be used for each task. One very promising approach is just to augment VoiceXML with graphical prompt and collect capabilities.
The Internet is important to voice applications:
· Voice application development is easier because VoiceXML is a high-level, domain-specific markup language, and because voice applications can now be constructed with plentiful, inexpensive, and powerful web application development tools.
· Applications are easy to deliver. No longer must they reside on a special-purpose voice server in a proprietary "walled garden": they can be placed anywhere on the Internet. The phone can be opened up to searches, third party applications, bookmarks and other web browsing features.
· Applications can be cleanly structured into service logic on the web server, and presentation logic, in VoiceXML pages delivered to the voice browser. This has many advantages, not the least of which is that a common application back end on the web server can serve up different types of presentation logic based on the user's device. This factoring can lead to huge savings.
And finally, voice, and therefore VoiceXML, will be important for web devices other than the phone. For example, if the voice actuated "universal remote" should become reality, it could have an on-board voice browser and maintain a VoiceXML menu page generated from the names and URLs of all the devices in range. When activated, the remote would interpret this VoiceXML menu, then go off and ask the named device for its top level menu of options, and so on.
Source: VoiceXML Forum
What sorts of voice applications are best suited for Voice systems? Here are a few ideas.
Corporate services including