Phone the Web Banner

Overview | Phone the Web resource library | People

Mobile Access

Voice Technology

Automated speech recognition (ASR) systems have greatly improved in recent years as better algorithms and acoustic models are developed, and as more computer power can be brought to bear on the task. An ASR system running on an inexpensive home or office computer with a good microphone can take free-form dictation, as long as it has been pre-trained for the speaker's voice. Over the phone, and with no speaker training, a speech recognition system needs to be given a set of speech grammars that tell it what words and phrases it should expect. With these constraints a surprisingly large set possible utterances can be recognized (e.g., a particular name out of thousands). Recognition over mobile phones in noisy environments does require more tightly pruned and carefully crafted speech grammars, however. Today there are many commercial uses of ASR in dozens of languages, and in areas as disparate as voice portals, finance, banking, telecommunications, and brokerages.

Advances are also being made in speech synthesis, or text-to-speech (TTS). Many of today's TTS systems still sound like "drunken robots", and can be hard to listen to or even at times incomprehensible. But waveform concatenation speech synthesis is now being deployed. In this technique, speech is not completely generated from scratch, but is assembled from libraries of pre-recorded waveforms. The results are promising.

It's important to note here that VoiceXML can be used without ASR or TTS: users can listen to recorded audio and press keys in response. Speech technology makes applications more powerful and pleasant to use, but VoiceXML brings the advantages of web development and deployment to older styles of computer telephony applications as well.

The future will bring more web devices: overnight delivery drop off boxes that schedule pickups and record their contents, networked MP3 portables, vending machines that reorder supplies when running low, wall displays that download artwork, web-based stereo receivers and televisions, and many others.

Speech technology, as it improves, will become a very natural and powerful interface for these ubiquitous web devices. Microphones are much smaller than keyboards and keypads; speakers are smaller than screens. So it seems quite likely that many future web devices will have on-board speech recognition (as do some mobile phones today), or perhaps that we'll carry voice-activated universal remotes to talk to the devices in our immediate surroundings.

Source: VoiceXML Forum


With VoiceXMl you can connect to the internet using a phone instead of a web browser. To do this you must call a server which runs a ‘voice browser’ , this is your voice portal to the intenet.

VoiceXML pages are stored on a web server and delivered to the voice browser in the same way as HTML is delivered to a web browser. Transitions are controlled by the <goto> tag (which is executed in VoiceXML, as compared with being 'clicked' by the user in a visual web browser).

Figure 1 : Static VoiceXML Application

Dynamic voiceXML applications function in much the same way, with the exception that some or all of the pages are generated dynamically, by a server based technology such as JSP/JavaBeans, PHP, ColdFusion, or scripting languages such as perl. An overview of this model is shown below.

Figure 2 : Dynamic VoiceXML Application

In this way the voice application and other ‘screen based’ web data such as HTML can share the same delivery platform.

What does a voiceXML application look like?

Here is a voiceXML page fragment that demonstrates a classical dtmf-based IVR menu:


    <property name="inputmodes" value="dtmf"/>


    For sports press 1, For weather press 2, For stock quotes press 3.


    <choice dtmf="1" next="http://www.sports.example/vxml/start.vxml"/>

    <choice dtmf="2" next=""/>

    <choice dtmf="3" next="http://www.stockquotes.example/voice/stock.vxml"/>


Bringing It All Together

The phone is important. Phones are everywhere in the developed world, and there are far more of them than internet connected computers. Mobile phones are achieving large penetration rates too: they are small, light, inexpensive, and have a long battery life, making them far more portable than computers. They can be used for applications that aren't feasible on computers, such as location-based services. Phones don't have to be booted up, and can be used while driving (though not always safely).

Voice is also important on the phone. For instance WAP is an useful technology, but WAP screens are small and can be restrictive, and keypad input can be difficult. WAP is far harder than voice to use while driving, and it is available only in a tiny percentage of phones and geographic regions. The i-mode system is more compelling, but shares many of these limitations. However, graphics is still important, and before long we'll see "multi-modal" devices which run applications that are voice-only, or graphics-only, or a mix of voice and graphics ("say the name of the city, or select it from the list"). The right modality can then be used for each task. One very promising approach is just to augment VoiceXML with graphical prompt and collect capabilities.

The Internet is important to voice applications:

· Voice application development is easier because VoiceXML is a high-level, domain-specific markup language, and because voice applications can now be constructed with plentiful, inexpensive, and powerful web application development tools.

· Applications are easy to deliver. No longer must they reside on a special-purpose voice server in a proprietary "walled garden": they can be placed anywhere on the Internet. The phone can be opened up to searches, third party applications, bookmarks and other web browsing features.

· Applications can be cleanly structured into service logic on the web server, and presentation logic, in VoiceXML pages delivered to the voice browser. This has many advantages, not the least of which is that a common application back end on the web server can serve up different types of presentation logic based on the user's device. This factoring can lead to huge savings.

And finally, voice, and therefore VoiceXML, will be important for web devices other than the phone. For example, if the voice actuated "universal remote" should become reality, it could have an on-board voice browser and maintain a VoiceXML menu page generated from the names and URLs of all the devices in range. When activated, the remote would interpret this VoiceXML menu, then go off and ask the named device for its top level menu of options, and so on.

Source: VoiceXML Forum


What sorts of voice applications are best suited for Voice systems? Here are a few ideas.

Information retrieval
News, sports, traffic, weather, and stock information, specialized information (e.g., intranet-based company news).

Electronic commerce
Catalog ordering applications CD, or video; groceries; office supplies; concert or game tickets. Customer service applications (package tracking, account status, and call centers). Financial applications -- banking, stock quotes and trading.

Corporate services including

Telephone services
Personal voice dialing, one-number "find-me" services, and teleconference - setup and management. An organization can upload up a voice web site to its voice service provider with information, news, upcoming events, and an address book. The address book could be used in voice dialing people in that organization.

Intranet applications
Inventory control, ordering supplies, providing human resource services, corporate portals.

Unified messaging
E-mail messages read over the phone, outgoing e-mail recorded (and in the future transcribed) over the phone, voice-oriented address information synchronized with personal organizers and e-mail systems. Pager messages can be originated from the phone, or routed to the phone. Checking the status of bids at electronic auction sites, bill payment authorization, charitable goods pickup scheduling, wake up reminder services.

Special needs
While all VoiceXML services will benefit visually impaired people, it may be that other services will be specially crafted for this community. Voice driven interfaces will also be of great benefit to people who are unable to leave their home due to disablity, providing them with a portal to the community simply using a telephone handset.

Community networks
A community network portal is a universal portal including telephone access to network structures supporting and enabling functions within a community, This is a way to meet other members of the community on-line, for exchanging information, announcements or organizing co-operative activity.

Overview | Phone the Web resource library | People

the vault
The Vault
go to the k m i web site