Speech recognition (AKA Automatic Speech Recognition - ASR) are technologies that allow computers to interpret human speech.
Most experts predict human-level ASR developed by 2010-2015. Two likely consequences of this will be:
- Lowered barrier for computer interactions would lead to much more frequent generation of information, search and consumption of information and communications.
- Increased generation of information would lead to creation of huge amounts of unstructured information, thoughts, ideas, memories, etc. Verbal information will be a part of a larger corpus of digital memory.
Importance of speech recognition[]
Speech recognition allows converting speech into text, making it easier both to create and to use information. Speech is easier to generate, it's intuitive and fast, but listening to speech is slow, it's hard to index speech, and easy to forget. Text is easier to store, process and consume, both for computers and for humans, but writing text is slow and requires some intention.
Applications[]
In the past people mostly imagined speech recognition directly producing the end-result, e.g. a dictated document or a computer performing a command. This is a limited perspective, as availability of speech recognition is likely to make possible much more varied applications. For example, speech recognition will likely be used...[]
- ...to send instant messages.
- ...to annotate and to comment.
- ...to keep real-time transcripts during conversations.
- ...to instruct and answer computers in a hands-free environment. (while driving; see DrivingCars, though)
- ...eventually, for general computer interaction; the Linguistic User Interface (LUI)
Communications[]
You will speak to tell someone something, and they will read to understand it. Your microphone will be connected to your instant messenger. When you say, "Jim, how are you doing?", the computer will recognize that you mean to talk with Jim, and will send the text "how are you doing?" to him.
Jim may be gone at the moment. But when he returns in 10 minutes, he may speak "Dave, I'm doing fine. Work was wearisome, but otherwise, I'm fine."
You are both speaking, but you are both reading each other's text.
Discovering Conversations[]
It is already possible to search globally across many electronic conversations, such as forum discussions or maillist archives. Speech recognition will make it possible to search virtually all (open) conversations.
Imagine that you are studying biology, for example: Mitochondria. You study with a co-learner, by voice, over the Internet. Because the subject is educational, you let the conversation be public. [1]
A computer program is transcribing your conversation in real-time, and another program is indexing your conversation in real-time.
A few states away, someone else is also studying biology. They perform a search, and discover the conversations you are having. They may leave a note at an information node representing your conversation [2] [3], or, if you are talking at that particular moment, opt to listen in. A small icon lets you know that someone is listening in on the conversation. You may invite her in, or she may knock requesting to come in.
This is made possible by speech recognition, but it is not a scenario most people think of when they think of speech recognition.
Annotations[]
Similarly, when you attach a comment to Slashdot, you will just hold down the spacebar, and speak your mind. Comment attached. Similar for attaching comments to documents, songs that are playing, or anything you care to comment on.
Ubiquitous transcripts[]
Recording conversations will be the norm. There will be few conversations about who said or didn't say something at work; It'll all be automatically recorded, like having a court reporter in every room. There will be a searchable, time-indexed, tagged and annotated transcript of everything that is spoken. Everything.
When people have a hard time understanding a concept, because it's being poorly presented, we'll have all the evidence we need. "See, when you explain things this way, it usually takes 3 times longer to explain it, than when you explain it this other way."
All of this is unlocked when you have Speech Recognition. Speech Recognition is no small thing. Do not be one of those people who envision themselves writing Word documents with speech recognition.
Processing recordings[]
Since there will be much more content generated and because, unlike intentionally written text it will be poorly formed and structured, we could not rely on the human author making it easy to read.
Wide creative use of speech recognition is an small step to smart documents. With ASR a linear audio recording (podcast) of someone's stream of consciousness can automatically acquire an index, a table of contents, etc. Software (AI) will have to process it and make it easy to navigate. The listener (or reader) could then become an interlocutor, asking questions, paraphrasing, agreeing or disagreeing to navigate the "document".
Interesting Developments[]
Augmented reality will make possible textual and rich-media conversations in real world too, not just in virtual space. Computers will process speech you hear and generate visual information for you, such as visual maps of arguments (ArgumentGraphs).
Subvocal recording (see NASA project) is a technology that can record thoughts "spoken" in your mind. It is easier and closer than one might think (2015-2025). Here an advantage of speech recognition is that it's a more direct link between the mind and a computer than keyboard or mouse. The only better technology is a direct brain-computer interface.
Current state[]
- Limited-vocabulary speech recognition is very good, and presently expanding into corporate phone trees (implementing voice applications with existing phone system).
- Large vocabulary (general) speech recognition still isn't perfect. You still have to speak a little slower, and corrections are necessary. But the computer is pretty good at recognizing context, and letting you correct it and can even learn your language use patterns using your e-mail and document archive. (Flash demonstration of ASR as of Nov 2004, article]).
Forecasts[]
IBM intends to have better-than-human ASR by 2010. Bill Gates predicted that by 2011 the quality of ASR will catch up to humans. Justin Rattner from Intel said in 2005 [4] that by 2015, computers will have "strong capabilities" in speech-to-text.