In today’s constantly on-the-go society, consumers expect efficiency. First they wanted to quickly turn on a song with a single, verbal command while they cooked dinner. But thanks to ongoing development of automatic speech recognition (ASR) technology, consumers can now do much, much more.
ASR is a subfield of Artificial Intelligence (AI) in which a computer recognizes spoken words and transforms them into text. The process is also commonly referred to as “speech-to-text.”
The process can be applied to live speech or audio/video recordings. In short, ASR is the technology that makes it possible to dictate texts into your iPhone or read transcripts of your voicemails.
And while its everyday applications are vast, ASR is also transforming how multiple industries do business. Media and entertainment creatives can produce content faster when hours of audio or video files are converted into searchable transcripts; educational institutions can provide safe, accessible, remote learning through real-time captioning in video conferencing software; and researchers can begin analyzing qualitative data in a matter of minutes thanks to asynchronous, machine-generated transcription. These are just a few examples of how speech-to-text applications are influencing society.
ASR technologies like our own Rev.ai offer cloud-based APIs specifically to help developers build applications powered by speech-to-text. If you want to incorporate speech-to-text capabilities into your product, an ASR API like Rev.ai’s could help you get to market faster than your competitors.
In ASR’s 60-year existence (it’s true!) speech scientists and engineers have made great advancements — so much so that sometimes it can be a little overwhelming, especially if you’re just beginning to learn about the field.
But don’t worry! Rev is here to help. After all, speech-to-text is kind of our whole thing. So in this article, we’ll take a look at four key features of ASR to help you better understand this exciting technology. Let’s dive in!
1. Accuracy
Accuracy means how precisely an ASR software converts spoken word into text. When it comes to evaluating the accuracy of an ASR service, we recommend calculating Word Error Rate (WER) to test how well the software performs.
WER can be calculated by adding Substitutions (words replaced), Insertions (words added), and Deletions (words omitted), divided by the total number of words spoken.
Think of it like golf scores — the lower your WER score, the more accurate your ASR service. But if a service has a high WER, that means the final machine-generated output will have more errors.
In fact, Rev’s benchmarking tests prove that our ASR service has the lowest WER (14.22 percent) of the competition.
Rev’s low WER is due to a couple of factors. First, there’s the volume and quality of our training data. We train our model on the same kind of data that our customers use us for — long form audio files across multiple industries, with multiple speakers and complex, industry-specific terminology. Second, we use the AI to build tools for our Revver community to help make them be more effective in their jobs. Our freelancers work with, and train, Rev.ai, and they provide ground-truth transcripts for our speech recognition team. In turn, our Rev.ai engine produces a rough draft from which all Rev transcriptionists begin their work. This combination of humans and AI working together allows us to train our ASR engine that much more accurately.
2. Turnaround Time
How fast can an ASR service process a file? Speech recognition customers expect to receive their output as quickly as possible, so our speech scientists are constantly working to meet those expectations. If you’re integrating a speech recognition API into your product, you want to be sure that turnaround time will be an asset, not a hindrance. However, balance is important — sometimes, if you make speed a priority, you sacrifice quality. This issue comes up frequently with ASR applications like live captions. Many providers can offer fast live captioning, but those captions will come at the expense of punctuation and readability.
Rev.ai’s asynchronous API can transcribe an hour of audio or video in minutes. The streaming API, on the other hand, has a latency in the magnitude of seconds when it comes to live captions or transcripts.
Rev.ai can also generate per-word timestamps for both our asynchronous and streaming APIs. This brings countless prospects for developers and companies seeking to embrace ASR technology. For instance, Descript’s Overdub solution allows you to quickly correct recordings through a transcript — simply add or a change a word in your transcript, and Overdub will add that word or correction to your audio track. This requires timestamps per word so that the technology will know where to look and make edits.
If you have large volumes of transcriptions that you need quickly, Rev.ai has got you covered!
3. Multiple Language Options
We are living in a multilingual society and many of the languages spoken around us overlap. Often, you’ll come across people who are bilingual or trilingual and use several languages. Communication tools like Zoom and Slack make it easier than ever to communicate across the world, so it only makes sense that speech recognition technology should evolve with the times. That’s why you should consider an ASR service that supports multiple language options.
For instance, Rev.ai is now available in Spanish, French, German, and Portuguese. We’ve trained it on a single model for all accents and dialects (i.e. the subtle differences between French from Paris and French from Canada, etc.) using real-world data so you can get results that are as accurate as possible.
Rev.ai can help you recognize different pronunciations and dialects, and distinguish between speakers. You can also use it to transcribe your domain-specific conversations – all thanks to our custom vocabulary feature. We offer the option to submit 6,000 custom words (the largest in the industry!) with each file, which means you can get all the nouns and technical terms right the first time.
4. Speaker Identification and Punctuation
Have you ever tried transcribing files that contain several speakers? And what if they talk over each other or interrupt each other frequently? Often those speakers may even sound quite similar. We know the struggle!
Speaker diarization can easily identify different people talking and attribute text to the right person. This means you know exactly who said what and when they said it – whether it’s a discussion between two people or a multi-speaker interview.
This feature is quite useful when you have to quote people afterward — Rev’s ASR can support up to eight speakers, and will save you the embarrassment of attributing text to the wrong person.
Another key feature that you should look for in an ASR service is punctuation and sentence structure. While it looks like a trait that every ASR software should have, that’s not necessarily true. In some cases, you’ll just get the text with no capitalization, punctuation, or even paragraph breaks.
This means significant effort on your end to transform it into something more readable. Rev’s ASR provides highly accurate punctuation and text normalization (“four oh one kay” becomes 401K) so that your transcript is much easier to read.
You also get access to a service that offers not only instantaneous speech recognition but automated capitalization in your transcript as well. It instantly punctuates (commas, colons, question marks, periods, and more) and capitalizes for a more legible transcription.
Find All These Features & Much More in Rev.ai
There you have it — four key features of a quality ASR engine.
ASR technology is progressively disrupting the way we operate in our classrooms, offices, and homes. With more features and applications, ASR services will continue to develop to best assist people who have now come to depend on them.
Rev is the only speech-to-text service that offers end-to-end options – from human to fully automated transcription and high-quality captions.
We train our models on noisy audio, which makes our service more resilient to noises in the room or your recordings. We provide you with APIs for exactly what you want in terms of accuracy, turnaround time, editing abilities, desired output formats, and price. You get real-time speech recognition starting at $0.035 a minute with no commitments. Our enterprise pricing starts at $0.02/min.