The terms speech recognition and voice recognition are popping up more and more frequently in news articles and social media. The development of these technologies have given us tools and digital assistants like Amazon's Alexa, Microsoft's Cortana, and Apple’s Siri and have made content more accessible to everyone. While the terms speech recognition and voice recognition are often used interchangeably, there are key differences between the two that are important to understand.
Voice recognition software is able to identify one specific voice with training. The training process usually entails the user going through a variety of phrases, for example, "I went to the store to pick up apples, oranges and bananas". The software uses these phrases to recognize the speaker, their inflections, and tone of voice. This process is what most digital assistants and voice to text apps use.
This works because:
- There is only one speaker
- There is limited functionality to the tasks being asked to accomplish
- The digital assistant can ask for a repeat of the phrase
- It can infer meaning, even if it misses a few words
We’ve all seen or experienced firsthand the usually funny, albeit frustrating, voice to text features and functions on our phones. While not all voice recognition systems are the same, there is still a ways to go before you don’t have to worry about calling your ex. “Tom” when you meant to call “Mom”.
In captioning, voice recognition software can be used by a Shadow Speaker. This is someone who trains a voice recognition program to transcribe audio into captions. The shadow speaker listens to a live audio feed, dictates that audio using a specialized microphone, and then sends the transcribed audio back to an encoder for broadcast. There are some issues with this method of captioning:
- If the shadow speaker is sick, or their voice changes, the accuracy falls dramatically.
- If the shadow speaker is not able to keep up with the live audio feed or can’t understand what is being said, the audio isn’t captioned.
Where voice recognition learns a specific voice, speech recognition software is able to identify speech itself. Using speech pattern algorithms and language models, speech recognition can transcribe any speaker without the software being trained to their specific voice. For the highest accuracy, high quality audio is necessary. To achieve this:
- Only one speaker should speak at a time.
- There should be minimal background noise.
- A high quality microphone is needed.
One application for automatic speech recognition, or ASR, is for automatic captioning, like the ACE series from Link Electronics. The system uses a state-of-the-art linguistics algorithm to turn speech into caption data. The system typically sees accuracy rates of 95% unless there is poor audio quality. This could mean:
- There is loud background noise.
- The speaker is mumbling or covering their mouth.
- There is more than one speaker at a time.
While ASR might not be 100% perfect, it allows the use of speech recognition technology without the user spending time training the software. There are ways of improving the accuracy of ASR. For example, with customization the ACE series can automatically identify individual speakers such as anchors and reporters with a speaker ID. Link Electronics can also create a custom language model to more accurately caption specific words, phrases, names of places, or people, that are unique to your locale.
The Future is all around us
Voice recognition and speech recognition are making their mark in our technology driven world. According to Gartner, 75% of households in the US will have smart speakers by 2020. That’s an estimated 68% increase since 2017. The market for this industry is estimated to reach nearly $32 Billion by 2025. And it isn’t just in smart speakers or phones that this technology is used. Audio to text software is growing in the healthcare industry, educational facilities, the financial industry, the government, and in television broadcast among many others. The potential for this technology is exciting, and it is allowing accessibility to those who don’t have what so many of us take for granted, the ability to hear and/or see. It can be as simple as being able to understand what is going on in your favorite TV show, to being able to access and understand the emergency weather reports for the storm heading your way. It may not seem like much to some, but speech and voice recognition has and will continue to be an integral part of our future.