Rev Unveils New Standard in Speech Recognition Testing

Reducing Evaluation Bias in Speech Recognition

Discover how Rev's new multitranscript method reduces bias in speech recognition evaluation, revealing more accurate measurements of AI transcription quality.

Written by:

Miguel del Rio

December 19, 2024

A man sits at a desk while working at a computer.

Table of contents

A Tale of Two Transcription Styles

Uncovering Hidden Biases in Model Evaluation

Groundbreaking Results From Real-World Testing

The Multitranscript Solution: A New Era of Evaluation

Rev's commitment to accuracy in speech-to-text technology has led to a groundbreaking discovery in how we evaluate AI transcription models. Our latest research reveals that traditional evaluation methods might be missing the mark – and the real accuracy rates could be better than we thought.

We’re constantly evaluating our internal Reverb model, open source models like Whisper and Canary, and speech-to-text systems of other companies to understand where we are and where the community needs to go to make our goal a reality. A big part of this work is ensuring evaluations are as fair and unbiased as possible – in this project, we’ve identified a new way to evaluate that can reduce transcription style bias.

A Tale of Two Transcription Styles

At Rev, we provide two styles of transcripts:

Verbatim transcripts are created by the transcriptionists writing exactly what they hear, including filler words, stutters, interjections (active listening), and repetitions.
Non-verbatim transcripts are created by lightly editing for readability without changing the structure or meaning of the speech. These are sometimes called clean transcriptions.

If you’ve ever submitted the same audio through both pipelines, you would find that these choices can result in very different outcomes — that’s not to say that either of these transcripts is wrong but rather is just stylistically different.

Uncovering Hidden Biases in Model Evaluation

When it comes to speech-to-text models, we run into the same situation. Rev’s models are trained to explicitly output verbatim or non-verbatim styles but that doesn’t mean other models can do the same thing. They may be closer to what we consider “verbatim” or “non-verbatim” but they can also be somewhere in-between or maybe even something else entirely. And just like with human transcripts, just because these transcripts are different stylistically, that doesn’t make them wrong.

Up until now, if we wanted to account for style we’ve been limited to evaluating speech-to-text models on either “verbatim” or “non-verbatim” only transcripts. And while it does give us some information, we’re still biasing our evaluation toward our own Rev styles.

Groundbreaking Results From Real-World Testing

We decided to expand upon two existing open source datasets to demonstrate this evaluation bias. Rev16 (podcasts) and Earnings22 (earnings calls from global companies) being verbatim datasets, we produced the corresponding non-verbatim transcripts and compared the word error rate (WER) of our internal model and Open.ai’s Whisper API. As you can see, our Verbatim API does better on the verbatim style transcripts, our Non-Verbatim API does better on the non-verbatim style, and Whisper floats somewhere between.

WER on Rev16

API	Verbatim Transcript	Non-Verbatim Transcript
Rev’s Verbatim Model	8.03	12.33
Rev’s Non-Verbatim Model	12.47	7.06
Open.AI’s Whisper Model	11.21	9.62

WER on Earnings22 Subset10

API	Verbatim Transcript	Non-Verbatim Transcript
Rev’s Verbatim Model	7.74	14.57
Rev’s Non-Verbatim Model	14.06	5.37
Open.AI’s Whisper Model	13.11	6.41

The Multitranscript Solution: A New Era of Evaluation

In this release, we provide code to produce fused-transcripts we call “multitranscripts” that allow our evaluation to be more flexible to different stylistic choices. When we use these multitranscripts instead of the single-style transcripts, we find a sizable difference in performance.

Given that all three APIs see a big improvement, it seems like the rate of real errors and not stylistic errors is a lot lower than previously thought!

WER on Rev16

API	Verbatim + Non-Verbatim Multitranscript
Rev’s Verbatim Model	5.43
Rev’s Non-Verbatim Model	5.24
Open.AI’s Whisper Model	5.45

WER on Earnings22 Subset10

API	Verbatim + Non-Verbatim Multitranscript
Rev’s Verbatim Model	6.35
Rev’s Non-Verbatim Model	5.29
Open.AI’s Whisper Model	4.50

Surprisingly, our initial evaluation showed that Rev’s API was better than the OpenAI API by about 20% on average but the new method shows that OpenAI surpassed us by about 15% on Earnings22 Subset10 and remains only slightly behind on Rev16! We’ve only just scratched the surface of this technique and are excited to continue exploring how to improve our evaluations.

Want to dive deeper into the technical details? Check out our full research paper on arXiv.

Topics: