Home
Media & Entertainment
Podcast Transcription Benchmark (Part 1)

Podcast Transcription Benchmark (Part 1)

Transcription is a key tool for podcast editing and engagement. Starting with an accurate ASR greatly speeds up the process. See how Rev AI performs against Google and Speechmatics.

Written by:
Abid Mohsin
February 7, 2019
Image of a microphone
table of contents
Hungry For More?

Luckily for you, we deliver. Subscribe to our blog today.

Thank You for Subscribing!

A confirmation email is on it’s way to your inbox.

Share this post

All of our customers want to know: which ASR service is the most accurate?

Typically, customers test ASR services from large tech companies (e.g., Google) and speech recognition-focused companies (e.g., Speechmatics). We decided to test the Word Error Rate (WER) of Google’s video model, Speechmatics, and Rev AI. Below is the benchmark for Podcasts, and we will be publishing benchmarks on other types of content soon.

Podcasts are popular and a good use case for transcription. Transcription helps creators in the editing process, and publishers to drive SEO. For both, accuracy of the transcript makes a big difference.

In Part 1, we will take you through the methodology, the data set, and the resulting WER. In Part 2, we will go into more details on where different ASRs do better and worse, and a word-by-word comparison of the transcripts.

Defining Word Error Rate:

There are multiple ways in which you can measure quality of an ASR service, e.g., sclite. We have developed a more robust internal testing methodology that takes into consideration synonyms, typos, and numbers (e.g. “10” as “ten”). However our method is largely derived from the standard Levenshtein distance formula for word error.

The formula is as follows: WER = (S + D + I) / N.

  • S is the number of substitutions
    • E.g. Ref: “I went to the store” vs Hyp: “I went to the shore
  • D is the number of deletions
    • E.g. Ref: “I went to the store” vs Hyp: “I went to store”
  • I is the number of insertions
    • E.g. Ref: “I went to the store” vs Hyp: “I went to the party store”
  • N is the number of words in the reference

Steps taken:

We selected 30 representative episodes from the most popular podcasts like “The Daily”, “My Favorite Murder”, and “Pod Save America” as our test suite. Here are the steps we took to generate Rev AI, Google, and Speechmatic’s WER for each file:

  1. Create the reference transcript (we used our human-generated verbatim transcript from Rev.com)
  2. Run each audio through Rev AI, Google’s video model, and Speechmatics to get the ASR transcripts
  3. Compare each word of the ASR transcripts to the reference transcripts and calculate the Word Error Rate (WER)

If you’d like any more details, please contact us.

The results:

Below is the average WER, and a table comparing WER by file. See the Airtable and Google Drive folder below for the full data set, including podcast MP3, reference transcript, and transcripts from each ASR service (you should be able to access all the details and / or make a copy).

Average WER:

  • Rev AI: 16.6%
  • Google Video model: 18.0%
  • Speechmatics: 20.6%

See the table below for the WER by each file. Note: the green highlighted cell is the lowest WER.

The table above shows:

  • Rev AI’s WER is the lowest in 18 out of the 30 podcasts
  • Google video model’s WER is lowest in the remaining 12
  • There is no file where Speechmatics has the lowest WER

In the next post we will try and get into why the different services do better or worse on the various audio files.

Download the full data:

You can also download all the files at once from this Google Drive folder.

Some considerations:

  1. WER is just one way to measure quality, specifically it only looks at the accuracy of the words. It does not take into account punctuation and speaker diarization (knowing who said what). We will do a more thorough comparison of the full transcripts in Part 2.
  2. WER does weigh all errors equally. Getting nouns and industry terminology correct is much more important than “umm” and “ah”. Adding custom vocabulary dramatically improves the accuracy on important terms, something we will talk about later.

Stay tuned for part 2, where we will dig into why we see differences in the ASR output – specifically, the types of errors, where Google and Rev AI perform better, and the causes of abnormally high WER on a couple of the test files.

We would love to get your comments and feedback, and what else you would like to see. Feel free to post below or drop us an email.

Topics:
No items found.

Heading

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript

Subscribe to the Rev Blog

Lectus donec nisi placerat suscipit tellus pellentesque turpis amet.

Share this post

Subscribe to The Rev Blog

Sign up to get Rev content delivered straight to your inbox.