How to Spot Deepfake Audio

How to Spot Deepfake Audio: 3 Tips for Detecting AI-Generated Speech

Though AI is getting good at replicating human speech, there are a few signs for telling them apart. Here's how to spot deepfake audio.

Written by:

Miguel Jette

May 5, 2024

A purple illustration of an audio recording with a purple background.

Table of contents

1. Flat Speaking Tone

2. Slurred, Unnatural Speech

3. Odd Background Noises

Can you tell the difference between a human voice and an AI voice?

It’s not as easy as you might suspect. A year ago, researchers at University College London found that AI-generated speech was realistic enough to fool listeners almost 25% of the time — and the technology has only improved since then. It now takes “about a minute to two minutes of a person’s voice” to create a convincing but fake AI voice using tools widely available online, according to computer science professor Hany Farid.

As generative AI continues to improve, synthetic audio will sound more and more like we do. But while advanced text-to-speech and speech-to-speech capabilities are a promising development with widespread potential for our work and lives — creators like podcasters can easily fix flubbed lines in post-production, and people with ALS are recreating their lost voices — it’s also introduced new concerns about how AI audio could be used maliciously.

Deepfake audio has dominated headlines as the cause for voice cloning and robocall scams, as well as fears of campaign misinformation during an election year. Meanwhile, researchers and startups are developing methods to detect deepfake audio, using a variety of techniques like sensing acoustic signals, metadata and artifact analysis, or searching for missing frequencies with machine learning models.

Even though AI is getting remarkably good at replicating human speech, there are still a few telltale signs for telling them apart. Here are three ways to help you recognize whether you’re listening to a deepfake voice or a real person.

1. Flat Speaking Tone

Emotion and sentiment are especially difficult to get right in AI-generated audio. When people talk to each other, they express their opinions and feelings through tonal shifts, emotional signals, and countless small but significant changes to their speech. If a voice sounds awkwardly flat and dry — or phrases don’t match up with the emotional delivery, like a sentence ending with an upwards lilt to imply a question that isn’t there — that’s a potential sign of a deepfake AI voice.

2. Slurred, Unnatural Speech

Deepfake audio is created by training a natural language processing model on sample recordings of an actual person’s speech. It’s like an extremely complex form of pattern matching: The more samples you use, the more closely an AI voice will resemble the person it’s meant to mimic (yes, AI is that efficient).

But this also means that AI voices can struggle with unusual or unique words that don’t appear in the samples. Slurred speech, mispronounced words, and awkward stumbling over phrases may suggest that you’re listening to deepfake audio.

3. Odd Background Noises

It’s never been easier to record clean and crisp audio, even if you’re recording from your phone. If you notice a lot of atypical background noise, like static or crackling noises, that’s another clue that it might be deepfake audio — especially if the speaker is somebody who would typically use professional recording equipment, like a famous creator or celebrity.

These tips can help spot deepfakes, but remember: Even the best AI detection software can fail to tell the difference between real speech and synthetic audio. That’s why the Better Business Bureau recommends following a basic rule: “If it sounds too good to be true, it probably is.”

Topics: