Voice
Dictation: A Progress Report
June 8, 2000
Bob Seitz
Key Ideas
(1) Over the past 3 years, voice dictation systems
have improved from 95%-96% accuracy to greater than 98% accuracy,
while becoming more user-friendly. At that rate, they may begin
replacing manual typing within the next 5 years.
(2) For some reason, voice dictation vendors haven't
introduced the high-speed, special-purpose digital signal
processors that the commercial market would seem to support.
(3) The current thrust seems to be toward
command-and-control speech recognition rather than continuous
dictation. This may be a reflection of the burgeoning market for
automatic voice-response systems.
Background
Truly effective voice dictation, when it
finally arrives, might be expected to render professional typists
largely obsolete. It should significantly reduce the time
required to enter information into a computer, and it may play a
vital role in the voice-enabled world that is now expected to
develop over the next five to ten years
The first generation of speech recognition systems
that appeared in 1997 weren't very practical. With recognition
accuracies typically in the 95% to 96% range, they simply made
too many errors to be very useful. You could type faster than you
could dictate and correct. I bought one of them, as did two of my
friends. All three of us had the same unsatisfactory experience.
However, these systems have steadily improved until they now
claim recognition accuracies in excess of of 98%. In addition,
Dragon Systems has now introduced a $79.95 hardware interface
called Dragon Naturally Clear USB System H100 that bundles a special sound card with a
high-fidelity, noise-canceling microphone to ensure a
consistently high level of input quality. My next computer will
have USB ports and I'll try one of these systems. (Curiously
enough, none of these vendors offer digital signal processors
that might really kick-start speech recognition the way video
cards enhance video game performance.) However, this
hardware cost is in addition to the price of the software
package, which, for Naturally Speaking Preferred 4.0 is $200,
minus a $50 rebate.
Four Main
Vendors:
There are four main vendors of low-cost voice
dictation systems: IBM, Lernout and Hauspie, Dragon Systems (just
purchased by Lernout and Hauspie), and Philips Speech Processing.
These four vendors' products have recently (March 30) been
reviewed in an excellent article called "Speak Softly and
Carry a Big Chip" in the New York Times.
Lernout and Hauspie recently bought out Kurzweil. L&H may
succeed through acquisitions where they couldn't make it on their
own. The two remaining players are IBM and Philips--huge
corporations.
Dragon's "Naturally Speaking 4.0"
Generally-speaking, Dragon Systems is
felt to have maintained a slight edge over IBM in this
horse-race. Dragon Systems' bundled microphone has been a little
better than IBM's.
IBM's "ViaVoice 4.0
IBM's entry is called
"ViaVoice". Prices and specifications may be found at
this hyperlinked location. (ViaVoice Pro costs $75.95, while
ViaVoice Standard runs $37.95. A demo version of ViaVoice Pro may
be downloaded free of charge.) One advantage to ViaVoice is that
it ma be used on computers that lack USB ports. Both of these
vendors state that their programs are optimized for cutting-edge
Athlon and Pentium III computers, although they will operate on
older machines. This is important because it means that these
companies are availing themselves of the latest computer
technology--a prerequisite to high performance speech (voice)
dictation.
Lernout &
Hauspie's "Voice Xpress Professional 4.0"
Lernout and Hauspie generally ranks
behind IBM and Dragon Systems. Lernout and Hauspie is 7%-owned by
Microsoft. L&H's "Voice Xpress Professional, Version
4" is the L&H voice dictation package to own.
Philips' "FreeSpeech 2000"
Philips is a dark horse in this race. Only its
$99.95 FreeSpeech 2000 program is deemed to be a serious
contender in this voice dictation derby.
All Four Have
Attractive Features
In their efforts to be
competitive, each of the four has unique strengths. Dragon
Systems and IBM lead in the all-important requirement for maximum
recognition accuracy. Lernout & Hauspie is ahead in
natural-language understanding of commands. FreeSpeech has an
extended repertoire of editing commands. At the same time, each
of these vendors soon adopts the improvements pioneered by its
competitors.
It should be mentioned that these voice
dictation systems don't truly use "natural" speech. One
must speak clearly, and these systems learn over time as mistakes
are corrected.
Considering the rate of progress in this
field, practical voice dictation for most of us will probably
appear within the next five years. Computer speeds should 10-fold
over this five-year period, while RAM and disk complements also
10-fold. Such hardware enhancements, coupled with software
evolution, ought to lead to major enhancements in what these
systems can do. If computers continue to improve apace, then by
2010, they will be running at 100 times today's speeds, with 100
times today's storage capacities, and voice dictation should be a
proper substitute for typing.
Voice Enabled
Appliances
A more-toothsome market opportunity for speech
recognition seems to be the area of voice-enabled commands.
This is easier to implement than speech dictation because the
vocabulary is limited and relatively free of ambiguities (e. g.,
homonyms).As anyone who's called a family of mutual funds
recently can attest, there's already a flourishing market for
this in voice response systems. Since the customers are
commercial, vendors can charge a lot more for such systems than
they can charge home users. One set of candidates for voice
enabling is home appliances such as microwaves. I have some
reservations about just how well that's going to sell. I can
picture someone saying, "...and tomorrow, we'd better start cooking the turkey about 10 o'clock..." and all of a
sudden, the microwave turns itself on. Voice warnings were tried
in cars a few years ago. About the first time someone alone in
her car jumped a foot when she heard a man's voice saying,
"Your oil is a little low", the voice-alert system got
disconnected. One of the problems with voice-enabled equipment is
that there are times when someone's asleep and you don't want to
disturb them. But right now, vendors are seeing $ signs when they
envision voice-enabled devices. We'll see. Personally, I'm all
for it.
MIT's Jupiter" System
MIT's Jupiter system is a conversational,
domain-specific, telephone-based weather information system with
a 2,000-word vocabulary. The following questions are typical of
the wording of natural-English questions that might be asked of
Jupiter.
"- What cities do you know about in California?"
"- How about in France?"
"- What will the temperature be in Boston tomorrow?"
"- What about the humidity?"
"- Are there any flood warnings in the United States?"
"- Where is it sunny in the Caribbean?"
"- What's the wind speed in Chicago?"
"- How about London?"
"- Can you give me the forecast for Seattle?"
"- Will it rain tomorrow in Denver?"
Jupiter can understand about 80% of the queries
received from novice users and more than 95% of those asked by
experienced users. Jupiter runs on an unoptimized 500 MHz Pentium
III. Jupiter contains a speech recognition module, SUMMIT, a natural
language interpretation module, TINA, and a sentence
synthesis module, GENESIS.
You may test this system by calling Jupiter at 1-888-573-TALK
(1-888-573-8255).
High Priced
Speech Recognition Systems
I wasn't able to find much information
concerning high priced speech recognition systems, but I have the
impression that they're targeting applications other than speech
dictation. As might be expected, AT&T and Lucent Technologies
are key players in this domain.
A Uniquely Human
Capability
It's significant to me that speech recognition
is a uniquely-human capability. I find thought-provoking the idea
that we can already do as well as we can do with computers in the
fraction of a gigops range. That's still down by a factor of
several hundred thousand below the 100,000 gigops speed that Hans
Moravec has estimated will be needed to emulate the human brain.
I should imagine that in no more than 5 years, when we reach
speed of several gigops in our laptops and desktops, we'll be
capable of pretty sophisticated voice dictation. And at the 100
gigops level projected for 2010, we should be able to approach
the human ear at least as far as word recognition is concerned.
One hundred gigops would still be only 0.1% of 100,000 gigops.
Interesting!