Reducing Evaluation Bias in Speech Recognition
Discover how Rev's new multitranscript method reduces bias in speech recognition evaluation, revealing more accurate measurements of AI transcription quality.

Rev's commitment to accuracy in speech-to-text technology has led to a groundbreaking discovery in how we evaluate AI transcription models. Our latest research reveals that traditional evaluation methods might be missing the mark – and the real accuracy rates could be better than we thought.
We’re constantly evaluating our internal Reverb model, open source models like Whisper and Canary, and speech-to-text systems of other companies to understand where we are and where the community needs to go to make our goal a reality. A big part of this work is ensuring evaluations are as fair and unbiased as possible – in this project, we’ve identified a new way to evaluate that can reduce transcription style bias.
A Tale of Two Transcription Styles
At Rev, we provide two styles of transcripts:
- Verbatim transcripts are created by the transcriptionists writing exactly what they hear, including filler words, stutters, interjections (active listening), and repetitions.
- Non-verbatim transcripts are created by lightly editing for readability without changing the structure or meaning of the speech. These are sometimes called clean transcriptions.
If you’ve ever submitted the same audio through both pipelines, you would find that these choices can result in very different outcomes — that’s not to say that either of these transcripts is wrong but rather is just stylistically different.
Groundbreaking Results From Real-World Testing
We decided to expand upon two existing open source datasets to demonstrate this evaluation bias. Rev16 (podcasts) and Earnings22 (earnings calls from global companies) being verbatim datasets, we produced the corresponding non-verbatim transcripts and compared the word error rate (WER) of our internal model and Open.ai’s Whisper API. As you can see, our Verbatim API does better on the verbatim style transcripts, our Non-Verbatim API does better on the non-verbatim style, and Whisper floats somewhere between.
WER on Rev16
WER on Earnings22 Subset10
The Multitranscript Solution: A New Era of Evaluation
In this release, we provide code to produce fused-transcripts we call “multitranscripts” that allow our evaluation to be more flexible to different stylistic choices. When we use these multitranscripts instead of the single-style transcripts, we find a sizable difference in performance.
Given that all three APIs see a big improvement, it seems like the rate of real errors and not stylistic errors is a lot lower than previously thought!
WER on Rev16
WER on Earnings22 Subset10
Surprisingly, our initial evaluation showed that Rev’s API was better than the OpenAI API by about 20% on average but the new method shows that OpenAI surpassed us by about 15% on Earnings22 Subset10 and remains only slightly behind on Rev16! We’ve only just scratched the surface of this technique and are excited to continue exploring how to improve our evaluations.
Want to dive deeper into the technical details? Check out our full research paper on arXiv.
Subscribe to The Rev Blog
Sign up to get Rev content delivered straight to your inbox.