At Rev, we believe we have the most accurate speech recognition service on the market. Today, we’re setting the bar even higher with the launch of our v2 ASR model, which offers an over 30% increase in accuracy when compared to our existing model.
We’ve tested our v2 ASR model extensively and found that this increase in accuracy holds good across a wide range of topics, industries and accents. This massive improvement is the result of two years of technical research and application of the latest deep learning techniques to our millions of hours of transcribed speech.
Technical Approach
Prior to v2, our model followed a so-called “hybrid approach,” based on combining multiple separately-trained components that use very powerful fundamental statistical models like Hidden Markov Models and Gaussian Mixture Models. Although extremely flexible, this hybrid system was not robust to different pronunciations, different acoustic environments, or multi-speaker audio; it was also less capable of learning from large quantities of data.
Our v2 model improves on this by using a single neural network in an end-to-end (E2E) model. Under this approach, the system is trained as a single unit, ingesting audio directly and learning as it goes. This approach largely solves key problems in accuracy, training, pronunciation/accents and diarization.
At Rev, we have taken advantage of this new approach and combined it with our large database of accurate transcripts to train the model and achieve the significant improvements mentioned above.
Benchmarks
So that’s the theory… now for some data. The two most important metrics we track are Word Error Rate (WER) and Speaker Switch WER, which we define as the WER in the region around when a speaker switch arises (a 5-word range around it).
V1 Model | V2 Model | Relative Gain | |
---|---|---|---|
Overall WER | 17.09% | 11.63% | 32% |
Speaker Switch WER | 30.17% | 18.46% | 39% |
This shows that our new model yields a 32% reduction in errors overall, and performs much better around speaker switches. The latter is particularly important for real-life scenarios such as meetings, which often have multiple speakers talking out of turn or over each other.
The table below goes a little deeper into the data and shows the distribution of Word Error Rate (WER) relative gains per different domains that we cover.
V1 Model | V2 Model | Relative Gain | |
---|---|---|---|
Overall WER | 17.09% | 11.63% | 32% |
Business | 20.57% | 13.19% | 36% |
Education | 20.80% | 14.22% | 32% |
Entertainment | 16.54% | 10.86% | 34% |
Health | 18.30% | 12.17% | 33% |
Law | 23.58% | 15.31% | 35% |
Politics | 18.65% | 13.41% | 28% |
Religion | 16.62% | 10.94% | 34% |
Science | 13.71% | 9.23% | 33% |
Sports | 21.41% | 14.40% | 33% |
These figures are based on our internal test suites.
Get Started with v2 ASR
The v2 ASR model described above is our default production model for new users as of March 7, 2022, and you can start using it today. When no transcriber
option is provided, or if the transcriber
option is explicitly set to machine_v2
, the audio file will be transcribed by the v2 ASR model.
Here’s an example of using the v2 model in an API call:
curl --location --request POST 'https://api.rev.ai/speechtotext/v1/jobs' \ --header 'Authorization: Bearer YOUR-ACCESS-TOKEN-HERE' \ --header 'Content-Type: application/json' \ --data-raw '{ "media_url": "https://www.rev.ai/FTC_Sample_1.mp3", }'
Existing users who have not yet been migrated to the v2 model as their default (see below for migration dates) should explicitly include the transcriber: machine_v2
parameter. Here’s an example:
curl --location --request POST 'https://api.rev.ai/speechtotext/v1/jobs' \ --header 'Authorization: Bearer YOUR-ACCESS-TOKEN-HERE' \ --header 'Content-Type: application/json' \ --data-raw '{ "media_url": "https://www.rev.ai/FTC_Sample_1.mp3", "transcriber": "machine_v2", }'
This also applies to SDK operations, as shown below in this example for our Node SDK:
// ... // initialize the client with your access token var client = new RevAiApiClient(accessToken); // set job options const jobOptions = { transcriber: 'machine_v2' // optional value for transcriber }; // submit a file var job = await client.submitJobUrl(mediaUrl, jobOptions); //...
For existing pay-as-you-go (PAYG) and enterprise users, the v2 ASR model will automatically become the default from April 7, 2022 (for PAYG users) and September 7, 2022 (for enterprise users). Once defaulted to the v2 ASR model, it will no longer be necessary to specify transcriber: machine_v2
in API and SDK operations. The v1 ASR model and related user preference will be deprecated on September 8, 2022.
Learn more about our Asynchronous Speech-to-Text API and transcription options (including a summary of the v1 to v2 migration roadmap).
Additional Notes
A few important points to note:
- The v2 model only supports asynchronous mode and English language input for now. Streaming support is coming soon and currently running in a closed beta. Contact support@rev.ai if you would like to participate in the beta.
- Transcription pricing for the v2 model is the same as under the previous model. For more information on pricing, please contact sales@rev.ai.
- The confidence scores under the v2 model, although more accurate, might appear slightly lower than those calculated under the previous model. Any customer logic dependent on confidence score calculation will need to be adjusted accordingly.
- The estimated turnaround time for v2 ASR is approximately 33%-40% faster than the previous model.
We’re excited about our new v2 ASR model and would love to hear your feedback and learn more about how you are using it. Let us know by emailing us at support@rev.ai.