Voice Dictation: A Progress Report
June 8, 2000
Bob Seitz

Key Ideas
(1)  Over the past 3 years, voice dictation systems have improved from 95%-96% accuracy to greater than 98% accuracy, while becoming more user-friendly. At that rate, they may begin replacing manual typing within the next 5 years.
(2)  For some reason, voice dictation vendors haven't introduced the high-speed, special-purpose digital signal processors that the commercial market would seem to support.
(3)  The current thrust seems to be toward command-and-control speech recognition rather than continuous dictation. This may be a reflection of the burgeoning market for automatic voice-response systems.

   Truly effective voice dictation, when it finally arrives, might be expected to render professional typists largely obsolete. It should significantly reduce the time required to enter information into a computer, and it may play a vital role in the voice-enabled world that is now expected to develop over the next five to ten years
   The first generation of speech recognition systems that appeared in 1997 weren't very practical. With recognition accuracies typically in the 95% to 96% range, they simply made too many errors to be very useful. You could type faster than you could dictate and correct. I bought one of them, as did two of my friends. All three of us had the same unsatisfactory experience. However, these systems have steadily improved until they now claim recognition accuracies in excess of of 98%. In addition, Dragon Systems has now introduced a $79.95 hardware interface called
Dragon Naturally Clear USB System H100 that bundles a special sound card with a high-fidelity, noise-canceling microphone to ensure a consistently high level of input quality. My next computer will have USB ports and I'll try one of these systems. (Curiously enough, none of these vendors offer digital signal processors that might really kick-start speech recognition the way video cards enhance video game performance.) However, this hardware cost is in addition to the price of the software package, which, for Naturally Speaking Preferred 4.0 is $200, minus a $50 rebate.

Four Main Vendors:
   There are four main vendors of low-cost voice dictation systems: IBM, Lernout and Hauspie, Dragon Systems (just purchased by Lernout and Hauspie), and Philips Speech Processing. These four vendors' products have recently (March 30) been reviewed in an excellent article called "Speak Softly and Carry a Big Chip" in the
New York Times. Lernout and Hauspie recently bought out Kurzweil. L&H may succeed through acquisitions where they couldn't make it on their own. The two remaining players are IBM and Philips--huge corporations.
Dragon's "Naturally Speaking 4.0"
    Generally-speaking, Dragon Systems is felt to have maintained a slight edge over IBM in this horse-race. Dragon Systems' bundled microphone has been a little better than IBM's.
IBM's "ViaVoice 4.0
    IBM's entry is called "ViaVoice". Prices and specifications may be found at this hyperlinked location. (ViaVoice Pro costs $75.95, while ViaVoice Standard runs $37.95. A demo version of ViaVoice Pro may be downloaded free of charge.) One advantage to ViaVoice is that it ma be used on computers that lack USB ports. Both of these vendors state that their programs are optimized for cutting-edge Athlon and Pentium III computers, although they will operate on older machines. This is important because it means that these companies are availing themselves of the latest computer technology--a prerequisite to high performance speech (voice) dictation.
Lernout & Hauspie's "Voice Xpress Professional 4.0"
    Lernout and Hauspie generally ranks behind IBM and Dragon Systems. Lernout and Hauspie is 7%-owned by Microsoft. L&H's "Voice Xpress Professional, Version 4" is the L&H voice dictation package to own.
Philips' "FreeSpeech 2000"
   Philips is a dark horse in this race. Only its $99.95 FreeSpeech 2000 program is deemed to be a serious contender in this voice dictation derby.

All Four Have Attractive Features
    In their efforts to be competitive, each of the four has unique strengths. Dragon Systems and IBM lead in the all-important requirement for maximum recognition accuracy. Lernout & Hauspie is ahead in natural-language understanding of commands. FreeSpeech has an extended repertoire of editing commands. At the same time, each of these vendors soon adopts the improvements pioneered by its competitors.
    It should be mentioned that these voice dictation systems don't truly use "natural" speech. One must speak clearly, and these systems learn over time as mistakes are corrected.

    Considering the rate of progress in this field, practical voice dictation for most of us will probably appear within the next five years. Computer speeds should 10-fold over this five-year period, while RAM and disk complements also 10-fold. Such hardware enhancements, coupled with software evolution, ought to lead to major enhancements in what these systems can do. If computers continue to improve apace, then by 2010, they will be running at 100 times today's speeds, with 100 times today's storage capacities, and voice dictation should be a proper substitute for typing.

Voice Enabled Appliances
   A more-toothsome market opportunity for speech recognition seems to be the area of
voice-enabled commands. This is easier to implement than speech dictation because the vocabulary is limited and relatively free of ambiguities (e. g., homonyms).As anyone who's called a family of mutual funds recently can attest, there's already a flourishing market for this in voice response systems. Since the customers are commercial, vendors can charge a lot more for such systems than they can charge home users. One set of candidates for voice enabling is home appliances such as microwaves. I have some reservations about just how well that's going to sell. I can picture someone saying, "...and tomorrow, we'd better start cooking the turkey about 10 o'clock..." and all of a sudden, the microwave turns itself on. Voice warnings were tried in cars a few years ago. About the first time someone alone in her car jumped a foot when she heard a man's voice saying, "Your oil is a little low", the voice-alert system got disconnected. One of the problems with voice-enabled equipment is that there are times when someone's asleep and you don't want to disturb them. But right now, vendors are seeing $ signs when they envision voice-enabled devices. We'll see. Personally, I'm all for it.

MIT's Jupiter" System
    MIT's Jupiter system is a conversational, domain-specific, telephone-based weather information system with a 2,000-word vocabulary. The following questions are typical of the wording of natural-English questions that might be asked of Jupiter.

"- What cities do you know about in California?"
"- How about in France?"
"- What will the temperature be in Boston tomorrow?"
"- What about the humidity?"
"- Are there any flood warnings in the United States?"
"- Where is it sunny in the Caribbean?"
"- What's the wind speed in Chicago?"
"- How about London?"
"- Can you give me the forecast for Seattle?"
"- Will it rain tomorrow in Denver?"

   Jupiter can understand about 80% of the queries received from novice users and more than 95% of those asked by experienced users. Jupiter runs on an unoptimized 500 MHz Pentium III. Jupiter contains a speech recognition module,
SUMMIT, a natural language interpretation module, TINA, and a sentence synthesis module, GENESIS.
You may test this system by calling Jupiter at 1-888-573-TALK (1-888-573-8255).

High Priced Speech Recognition Systems
    I wasn't able to find much information concerning high priced speech recognition systems, but I have the impression that they're targeting applications other than speech dictation. As might be expected, AT&T and Lucent Technologies are key players in this domain.

A Uniquely Human Capability
   It's significant to me that speech recognition is a uniquely-human capability. I find thought-provoking the idea that we can already do as well as we can do with computers in the fraction of a gigops range. That's still down by a factor of several hundred thousand below the 100,000 gigops speed that Hans Moravec has estimated will be needed to emulate the human brain. I should imagine that in no more than 5 years, when we reach speed of several gigops in our laptops and desktops, we'll be capable of pretty sophisticated voice dictation. And at the 100 gigops level projected for 2010, we should be able to approach the human ear at least as far as word recognition is concerned. One hundred gigops would still be only 0.1% of 100,000 gigops. Interesting!