Testing Rev.ai's Speech Recognition Accuracy

The Podcast Challenge: Testing Rev.ai’s Speech Recognition Accuracy

We tested our Rev.ai Speech Recognition Technology's accuracy on a wide range of podcasts. Spoiler alert: we beat our competition.

Written by:

Miguel Jette

September 23, 2020

A cheerful robot holding a trophy with other robots looking surprised and disappointed.

Table of contents

How Accurate is Rev.ai’s Automated Speech Recognition?

Building the Test Suite

Detailed overview of the test suite

List of Podcasts and Corresponding Episodes

Detailed Overview of the Results

Here on the Rev Speech R&D Team, we are constantly striving to improve Rev.ai’s Automated Speech Recognition (ASR) accuracy.

As such, we spend a lot of our time creating test suites for the many different scenarios where our customers use our speech recognition technology. One of those use cases? Podcasters who want to produce transcripts of their shows. In order to assess how well our ASR works for their particular needs, we collected a few of the most popular podcasts and used them as a test to determine how accurately Rev.ai performs.

In this blog, we will first present the results we obtained for these tests, then we will discuss the steps required to produce the test suite. Finally, we will examine the details of these particular podcasts to illustrate the difficulty of the test suite.

We hope this gives you some insight into how we think about our ASR system’s accuracy and showcases how we stack up against our competition.

How Accurate is Rev.ai’s Automated Speech Recognition?

First, some context. I wanted to add a disclaimer that we, the Rev Speech R&D team, use a proprietary toolkit to calculate Word Error Rate (WER). Fundamentally, the software still calculates the same metric, but our methodology takes into consideration synonyms, typos, and number representations (e.g. “10” as “ten”). Therefore, our methodology enables us to calculate the best WER possible for each provider. We hope to be able to share this approach in a future article explaining the details of the technique.

Also, as a reminder to the reader, WER is calculated using the following formula:

A mathematical formula for calculating Word Error Rate (WER) as a ratio of insertions, deletions, and substitutions to the total number of words in a reference transcript.

As a baseline, the graph below illustrates the accuracy results per speech recognition service, as of August 2020:

A bar chart comparing word error rates of various transcription services, including RevAI, Google, and Amazon.

Building the Test Suite

Now, let’s take a look at how we built our test suite.

First, as an absolute golden rule, one needs to ensure that the data selected is not used in the training ASR you are testing. While randomly selecting podcasts to include in this test suite, our team made sure to carefully select audio files that were not part of the training for our ASR models, as to prevent any unfair advantage

Secondly, the amount of audio needs to be large enough for the error rate to be significant and meaningful for any analysis. That’s why our team carefully chose 30 episodes that amount to 27.5 hours of speech. We consider this to be significant enough to assess the accuracy of our models.

Finally, in order to test an ASR model properly, one should always consider as wide a range of acoustic conditions as possible, even within a given domain. This test suite covers a vast array of podcast genres, with many different speakers: storytelling with sound effects (Crimetown), group discussions with a lot of speaker overlap (The Read), and scripted news podcasts (The Daily).

In order to get accurate transcripts, we sent the 30 podcasts to human-powered Rev.com service, choosing the verbatim option (to include as many words and repetitions as possible), also including a dictionary of important words to make sure proper names, like Kwame Kilpatrick, were properly transcribed.

Run Your Own Analysis: Try Rev’s free Developer tools to Measure Speech Recognition Accuracy

Detailed overview of the test suite

Let’s take a look at the kind of data included in this test suite.

List of Podcasts and Corresponding Episodes

Included below is the list of podcasts included in this test suite.

File 1: This American Life: Episode #661 ????️ Society & Culture
File 2: The Read: Rebel Without a Cause ????️ Comedy
File 3: The Read: Spice or Sour Cream? ????️ Comedy
File 4: The Daily: The Plan to Discredit the Florida Recount ????️ News
File 5: The Daily: The California Wildfires ????️ News
File 6: The Moth Radio Hour: Hope and Glory ????️ Art
File 7: The Moth Radio Hour: Deer Meat Dance Moves and Motherhood ????️ Art
File 8: Podcasts in Color, Women Creating Podcast Networks ft. Ahyiana of @SPQPodcast ????️ Society & Culture
File 9: Podcasts in Color, Creating Your Own Lane in Podcasting ft @Favyfav of @latinoswholunch ????️ Society & Culture
File 10: Podcasts in Color, Podcast Tips From Berry ????️ Society & Culture
File 11: Heavyweight: Episode #9 Milt ????️ News
File 12: Heavyweight: Episode #10 Rose ????️ News
File 13: Crimetown, Coming Soon: Season 2 ????️ True Crime
File 14: Crimetown, Bonus Episode: Buddy Cianci…The Musical ????️ True Crime
File 15: Crimetown, Chapter 18: The Prince of Providence ????️ True Crime
File 16: Pod Save America, We won. ????️ News
File 17: Pod Save America, The election is nigh! ????️ News
File 18: The Daily Zeitgeist, The Christian Black Panther? ????️ News
File 19: The New Yorker: The Writer’s Voice, Tommy Orange Reads The State ????️ Art
File 20: Skidmarks Show, Episode 66 ????️ Leisure
File 21: Food Psych, Episode #148 ????️ Health
File 22: My Favorite Murder with Karen Kilgariff and Georgia Hardstark, Episode 145 ????️ True Crime
File 23: Sorta Awesome, Episode 169 ????️ Society & Culture
File 24: Drinkin’ Bros., Episode 338 ????️ Society & Culture
File 25: Roads From Emmaus, What We Own is Sacred Because We Are Sacred (Oct. 14 2018). ????️ Religion & Spirituality
File 26: The Bill Barnwell Show, Vince Verhei & Doug Kyed ????️ Leisure
File 27: The Ross Bolen Podcast, Lemurs Are Important Pollinators ????️ Society & Culture
File 28: Forked Up: A Thug Kitchen Podcast, All Casper Everything with Natalie Eva Marie ????️ Health
File 29: American Fiasco, Bonus Episode with Stephen Dubner of Freakonomics Radio ????️ Society & Culture
File 30: Recovery Elevator, Episode 195: What Should the Bottle Say? ????️ Health

By Genres

Podcasts come in all different genres. We have selected a wide variety of genres that covers most of the popular genres out there:

Genre	Number of podcasts
Society & Culture	8
News	7
True Crime	4
Art	3
Health	3
Comedy	2
Leisure	2
Religion & Spirituality	1

By Lengths

We chose enough podcasts to have a long enough test suite of around 27.5 hours of audio. Most of the podcasts chosen were below 60 minutes in length, but we also included some longer episodes to be able to test the behavior of our system for longer files.

A chart illustrating the number of podcasts grouped by speaker count.

Figure 3: Distribution of the length of podcasts used in the test suite.

By Speakers

A key indicator of how difficult an audio is for an ASR is how many speakers are present. Again, we made sure to include enough variety of speakers in the podcasts we chose. Some of these episodes only have two speakers for the whole file, and some have as many as 35 speakers (e.g. True Crime podcasts with many characters). Of course, any third-party ads in any given podcast counted as new speakers, as well.

Figure 4: Distribution of the number of speakers in the podcasts used in the test suite.

By SNRs

Another key indicator of how difficult audios are for an ASR is the signal-to-noise ratio (SNR) level. Here we share the distribution of the SNR levels for all podcasts included in the test suite.

Here, we show the average of the peak SNR measured per segment (dB), where segments are defined as 1.92 second long.

We can see that about 6 podcasts have what we would consider slightly more noisy acoustic environments (<30 dB) and the rest are very good in studio quality recordings.

A histogram showing the distribution of podcast durations in minutes.

Figure 5: Distribution of the average peak SNR for all podcasts in the test suite.

Detailed Overview of the Results

Below are the results organized by file. Against four other competitors, Rev.Ai has better WER on 18 of the 30 files — 60 percent. In second place came the Speechmatics v2 API, with 11 wins. Interestingly, Microsoft performed better than other API on file 13.

File	Rev.ai	Speechmatics v2	Google Video	Microsoft	Amazon
1	5.85%	6.29%	7.72%	8.32%	8.22%
2	19.4%	18.82%	20.05%	20.32%	23.28%
3	21.99%	20.92%	21.74%	22.33%	26.3%
4	9.7%	7.54%	10.06%	10.35%	9.6%
5	8.61%	8.18%	8.75%	10.59%	10.4%
6	6.34%	7.39%	8.59%	9.56%	9.5%
7	4.17%	5.67%	6.46%	6.85%	7.01%
8	6.11%	9.26%	9.4%	10.15%	10.56%
9	12.12%	14.46%	14.54%	15.37%	16.35%
10	10.24%	11.36%	10.88%	12.01%	13.93%
11	14.46%	13.03%	13.86%	16.06%	16.48%
12	9.4%	9.61%	10.39%	11.66%	12.33%
13	19.44%	18.63%	16.71%	12.67%	19.83%
14	19.74%	16.5%	17.2%	19.66%	21.57%
15	18.39%	18.94%	20.67%	21.02%	23.03%
16	13.34%	14.14%	15.84%	16.52%	16.89%
17	11.46%	12.46%	13.92%	14.73%	15.04%
18	19.44%	20.09%	21.78%	22.55%	25.73%
19	5.66%	5.51%	6.14%	7.83%	7.72%
20	19.6%	20.47%	20.92%	21.18%	25.63%
21	4.96%	5.65%	6.74%	6.77%	7.29%
22	19.73%	18.31%	19.64%	20.37%	23.81%
23	8.69%	9.59%	10.53%	11.1%	12.39%
24	21.71%	23.47%	26.25%	26.36%	28.68%
25	4.71%	3.92%	4.19%	5.43%	5.98%
26	14.85%	17.28%	19.27%	18.5%	21.61%
27	10.43%	11.97%	11.82%	13.05%	14.04%
28	19.88%	19.38%	20.87%	22.04%	25.53%
29	13.91%	13.65%	15.48%	16.62%	17.96%
30	10.08%	12.79%	12.81%	14.38%	17.0%