Podcast Transcription Benchmark (Part 1)

Transcription is a key tool for podcast editing and engagement. Starting with an accurate ASR greatly speeds up the process. See how Rev AI performs against Google and Speechmatics.

Written by:

Abid Mohsin

February 7, 2019

Table of contents

Defining Word Error Rate:

Steps taken:

The results:

Download the full data:

Some considerations:

All of our customers want to know: which ASR service is the most accurate?

Typically, customers test ASR services from large tech companies (e.g., Google) and speech recognition-focused companies (e.g., Speechmatics). We decided to test the Word Error Rate (WER) of Google’s video model, Speechmatics, and Rev AI. Below is the benchmark for Podcasts, and we will be publishing benchmarks on other types of content soon.

Podcasts are popular and a good use case for transcription. Transcription helps creators in the editing process, and publishers to drive SEO. For both, accuracy of the transcript makes a big difference.

In Part 1, we will take you through the methodology, the data set, and the resulting WER. In Part 2, we will go into more details on where different ASRs do better and worse, and a word-by-word comparison of the transcripts.

Defining Word Error Rate:

There are multiple ways in which you can measure quality of an ASR service, e.g., sclite. We have developed a more robust internal testing methodology that takes into consideration synonyms, typos, and numbers (e.g. “10” as “ten”). However our method is largely derived from the standard Levenshtein distance formula for word error.

The formula is as follows: WER = (S + D + I) / N.

S is the number of substitutions
- E.g. Ref: “I went to the store” vs Hyp: “I went to the shore”
D is the number of deletions
- E.g. Ref: “I went to the store” vs Hyp: “I went to store”
I is the number of insertions
- E.g. Ref: “I went to the store” vs Hyp: “I went to the ~~party~~ store”
N is the number of words in the reference

Steps taken:

We selected 30 representative episodes from the most popular podcasts like “The Daily”, “My Favorite Murder”, and “Pod Save America” as our test suite. Here are the steps we took to generate Rev AI, Google, and Speechmatic’s WER for each file:

Create the reference transcript (we used our human-generated verbatim transcript from Rev.com)
Run each audio through Rev AI, Google’s video model, and Speechmatics to get the ASR transcripts
Compare each word of the ASR transcripts to the reference transcripts and calculate the Word Error Rate (WER)

If you’d like any more details, please contact us.

The results:

Below is the average WER, and a table comparing WER by file. See the Airtable and Google Drive folder below for the full data set, including podcast MP3, reference transcript, and transcripts from each ASR service (you should be able to access all the details and / or make a copy).

Average WER:

Rev AI: 16.6%
Google Video model: 18.0%
Speechmatics: 20.6%

See the table below for the WER by each file. Note: the green highlighted cell is the lowest WER.

The table above shows:

Rev AI’s WER is the lowest in 18 out of the 30 podcasts
Google video model’s WER is lowest in the remaining 12
There is no file where Speechmatics has the lowest WER

In the next post we will try and get into why the different services do better or worse on the various audio files.

Download the full data:

You can also download all the files at once from this Google Drive folder.

Some considerations:

WER is just one way to measure quality, specifically it only looks at the accuracy of the words. It does not take into account punctuation and speaker diarization (knowing who said what). We will do a more thorough comparison of the full transcripts in Part 2.
WER does weigh all errors equally. Getting nouns and industry terminology correct is much more important than “umm” and “ah”. Adding custom vocabulary dramatically improves the accuracy on important terms, something we will talk about later.

Stay tuned for part 2, where we will dig into why we see differences in the ASR output – specifically, the types of errors, where Google and Rev AI perform better, and the causes of abnormally high WER on a couple of the test files.

We would love to get your comments and feedback, and what else you would like to see. Feel free to post below or drop us an email.

Topics: