Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (2024)

Tan Jian Hong

Published in

See Also

As the field of audio deepfake detection research is relatively new, methods for detecting these deepfakes are still evolving. Detection methods generally fall into three categories:

A. Hand-crafted perceptual features: This method involves comparing statistical differences between synthetic and human speech. Using advanced signal processing algorithms, comparisons are made with variations in normalised amplitude range, pitch deviation, cadence, phonetic transitions and pauses between utterances. Results from this detection method can be easily interpreted by observing differences in the pitch pattern of a synthetic sample against the pattern of a natural speech by a human. Based on the two images presented below, it is evident that there are variations in the pattern of the locus, where the highest values of the normalised short-range autocorrelation function are observed between synthetic speech and natural speech. Additionally, the configuration of the bright region differs between synthetic speech and natural speech. These distinctions can be attributed to the synthetic speech generation process, which uses specific high-probability symbols within the model.

Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (3)

Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (4)

B. Generic Spectral Feature-Based Detection: Another method in the arsenal of detection strategies involves the utilisation of generic spectral features. While the specifics of this technique may vary, it typically uses ML methods to extract audio spectral features and statistically compare them. Spectral features are characteristics derived from the frequency domain representation of audio signals, and they can provide valuable information about the underlying sound. One of the tools that can be used is openSMILE (open-source Speech and Music Interpretation by Large-space Extraction), a popular toolkit for extracting audio features. This approach relies on a data-driven methodology to discern deepfakes from genuine audio through the analysis of spectral characteristics.

C. End-to-end Deep Neural Network-Based Solution: The third category uses the power of end-to-end deep neural networks to detect audio deepfakes. We use this method for our work on audio deepfake detection and will expand more on this.

A typical setup of an end-to-end deep neural network-based solution is shown below.

Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (5)

Audio deepfake detection using deep neural network models requires a large dataset of labelled data to learn and identify artefacts that may indicate a synthetic speech audio clip. During inference, the audio deepfake detector pre-processes the input, passes it through the model and classifier, and determines whether the speech audio clip is genuine or synthetic.

These end-to-end deep neural network-based models have gained prominence in the audio deepfake detection landscape for their ability to automatically learn intricate patterns from raw audio data. The models can be divided into two subcategories:

A. Computer Vision-Based Solution: Innovations in Computer Vision (CV) have been adapted to audio deepfake detection with the translation of audio spectrograms into images. Three main image-based features used in audio deepfake detection are shown below. The colour, x and y axis represent amplitude, time and frequency respectively. These “audio images’’ can be processed using Convolutional Neural Networks (CNNs) and other vision-inspired techniques. This approach not only considers the audio content but also its visual representation, enhancing detection accuracy. The three spectrogram representations that are frequently used (as shown below), are Constant-Q function (CQT), Mel, and Log, each with its own advantage. The constant Q function spectrogram (left) can capture fine-grained spectral features that are important in differentiating between a real and deepfake audio, whilst the Mel spectrogram (centre) captures the pitch and timbre features that is important for human speech recognition. Log spectrograms (right) are useful for capturing dynamic range of audio signals. In general, the CQT spectrogram is more accurate when used for audio deepfake⁵.

B. Raw Audio Input-Based Solution: Raw audio input-based models, such as the renowned “wav2vec,” directly operate on the waveform data. This approach bypasses the need for handcrafted features, allowing the model to autonomously extract meaningful information from the audio. Raw audio models can capture fine-grained details in the audio signal, making them potentially more robust to variations in the acoustic environment, accents, and speaking styles.

Generalising is crucial for AI models to perform well on a wide range of tasks and adapt to new data and situations. It ensures that AI systems are not just memorising training data, but learning meaningful patterns/artefacts, making them more reliable, robust, and applicable to real-world scenarios. However, recent research and experiments show that it will be challenging to generalise audio deepfake detectors. The detectors exhibit a sharp drop in performance especially during cross-dataset testing. Why is this the case?

The primary issue lies in the diversity and complexity of audio deepfakes. Audio deepfakes span across a wide range of acoustic scenarios, accents, and languages, making it challenging to develop a one-size-fits-all detection model. Different audio deepfake generators using different model architectures may produce contrasting artefacts that are picked up by deep neural network-based solutions. As a result, when tested against data it hasn’t encountered during training, a deepfake detector’s performance suffers. Fraunhofer AISEC conducted an experiment with various state-of-the-art deep neural network models at that time, such as LCNN, LCNN-LSTM, MesoNet, RestNet18, RawNet2, RawGAT-ST. They discovered that most, if not all, of the current models suffer from Out-of-Distribution (OoD) detection with cross-dataset benchmarking.

Cross-dataset benchmarking is a process of evaluating model performance across different datasets to understand its effectiveness and generalisability. In their experiment, they used the best feature type determined by experimentation, trained the models with ASVspoof19 training data, and evaluated them against ASVspoof19 evaluation data and Fraunhofer’s own In-the-Wild dataset. The In-the-Wild dataset⁶ consists of 377.9 hours of audio clips that are either fake (17.2 hrs) or real (20.7 hrs). The table below shows the comparison of different audio deepfake models and results of the cross-dataset validation.

Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (6)

An Equal Error Rate (EER) is used to evaluate the performance of a binary classifier (e.g., for biometric systems and presentation attack detection, etc.), also defined as the point where False Acceptance Rate (FAR) and False Rejection Rate (FRR) are equal. Generally, a higher EER % represents lower performance because it also corresponds to higher false positives and false negatives. Based on the results of the cross-validation experiment presented in the table above, it is evident that the majority, if not all, of the models included in the study encountered difficulties when it comes to identifying OoD data that is not part of the training data distribution. This is indicated by a substantial increase in EER % when validated against the In-the-Wild Dataset consisting of deepfakes generated by newer model architectures.

One of the ways to tackle the OoD detection issue in audio deepfakes is to actively monitor the landscape for emerging generators and generate new training data for model retraining. The S&S team also verified this by mixing both ASVspoof19 and In-the-Wild dataset to train both AASIST and CNN-LSTM models to show the importance of staying up to date with the latest developments in the generator domain.

As presented in the second table below, the team conducted training and evaluation on various dataset combinations and compared the results. It was noted that the model performed well when evaluated against the distribution of data it was originally trained on but performed poorly when presented with data outside its training distribution. However, when the model trained with a combination of datasets, there was a noticeable improvement in its performance, demonstrating competence for both data distributions, namely ASVspoof19 and In-the-Wild dataset. To further support this observation, the team tested the new models on the ASVspoof21 dataset, which is an extension of the ASVspoof19 dataset. While the results still indicated that the models had some out-of-distribution (OoD) issues, it is noteworthy that the AASIST model generalised better than the CNN-LSTM model when assessed on the ASVspoof21 dataset.

Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (7)

The battle against audio deepfakes is a never-ending cat and mouse chase. As defenders improve their detection methods and generate data to counter new generators, bad actors respond by refining their techniques. This constant evolution of tactics and counter-tactics underscores the importance of staying up to date with the latest developments in the field. We can remain up to date by:

Continuous retraining of models with different combinations of benchmark datasets.
Trying the latest models from the research community which validates existing datasets.
Testing new spoofing methods for evaluation.

If you are interested in this topic, Fraunhofer AISEC has also conducted experiments on other parameters such as the evaluation audio input length, fixed vs variable evaluation audio input length, feature extract technique (i.e., cqtspec, logspec, melspec), which you can read in their paper on “Does Audio Deepfake Detection Generalize?”⁷.

We have chosen to focus on models that are more capable of generalising based on our cross-data validation experiment results as shown above. Models that use raw audio input-based deepfake detectors were selected as it generally performs better when conducting cross-data benchmarking. This could be due to raw audio input detectors being able to learn more generalisable patterns in the audio data. The model we chose is the Audio Anti-Spoofing using Integrated Spectro-Temporal (AASIST), which is a lightweight model that uses fewer parameters than its peers (e.g., RawGAT-ST uses 437k parameters while AASIST uses 297k parameters). This makes it more efficient and easier to deploy. Moreover, it performs slightly better than its raw input peers in terms of EER % for both ASVspoof19 and In-the-Wild evaluation. See the overall architecture of AASIST below.

Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (8)

AASIST uses Graph attention network (GAT) to learn the representations of relationships between different time frames and spectral bands in an audio signal. These representations are then used to classify the audio input as either real or fake. The AASIST model has been shown to outperform other state-of-the art models on audio deepfake detection tasks. This is likely because GAT can learn complex relationships between the different time frames and spectral bands in an audio signal, which is important for detecting audio deepfakes.

Instead of focusing solely on developing a deepfake detector, we adopted a layered defence approach by integrating speaker verification models in the detector. Speaker verification models are designed to authenticate the identity of a speaker by comparing their voice against a known reference voiceprint. These models can be used to verify the identity of individuals and prevent scenarios such as real humans trying to imitate the target speaker.

It has been shown in a study⁸ where researchers adapting several state-of-the-art speaker verification models were able to achieve good performance, high generalisation ability (i.e., ability to detect OoD data) and high robustness to audio impairment.

The short answer is no. As demonstrated earlier, audio deepfake detectors suffer from OoD detection. This means that they are not able to detect deepfakes that were created by a generator unknown to it. Additionally, an audio deepfake detector is susceptible to adversarial attacks by adding small, sometimes imperceivable noises to the data to fool the model. With audio deepfake detection, noises can be added to an adversarial attack (e.g., Gaussian, RIR, etc.) to make it appear real to the detector.

In a study⁹ by Piotr Kawa, a PhD student at the Wroclaw University of Science and Technology, various deepfake detectors were evaluated against a variety of adversarial attack methods (e.g., Fast Gradient Signed Method, Projected Gradient Descent, Fast adaptive Boundary, etc). The study found that the detectors were vulnerable to all the adversarial attack methods tested, and that the performance of the detectors could be reduced. The EER can rocket as high as 99% by using white-box attack method techniques.

What can we do to bolster the robustness of audio deepfake detection models against adversarial attacks? Researchers have been exploring various strategies, and one promising approach is training these models on a dataset of audio deepfakes that have been augmented with different types of noise. By introducing noisy data during training, models may become more resilient to adversarial attempts¹⁰.

It is important to know that there is no foolproof way to defend against adversarial attacks. Adversarial attackers are constantly developing new attack methods, and it is often challenging for deepfake detectors to keep up. You need to be aware of the limitations of deepfake detectors and to use them in conjunction with other measures.

What More Can Be Done to Guard Against Audio Deepfakes?

A multi-pronged approach can be used to enhance protection against deepfake threats. This means using a variety of strategies and methods rather than just focusing on detection alone. Here are some potential methods that can be considered:

1. Audio watermarking

Audio watermarking is a technique that embeds a hidden signature in an audio signal. This signature can be used to identify the source of the audio or verify its authenticity. Audio watermarking can be used to verify the authenticity of an audio clip by embedding a signature in the official audio/speech. The watermark can identify and authenticate the source of the original audio/speech and verify that no manipulation has been performed on the audio clip. One such technique is the Spread Spectrum Audio Watermarking (SSW), where the key feature is to spread the watermark signal across a broad range of frequencies, making it difficult to detect or remove without affecting the perceived audio quality. Below is an example of an imperceivable watermark using Spread Spectrum watermarking technique.

Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (9)

2. File structure and metadata analysis

File structure and metadata analysis is a technique that examines the structure and metadata of an audio file to identify signs of manipulation. For example, deepfakes created using specific generators have a unique file structure that can be matched with a file format signature or have unusual metadata, such as inconsistent timestamps (as shown below). By analysing the file structure and metadata of an audio file, it is possible to identify signs of manipulation, deduce the generation history and determine whether the audio file is authentic.

Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (10)

3. Attacker signature detection

Attacker signatures are unique patterns or characteristics that can be used to identify the work of a particular attacker (i.e., deepfake generator). Research on attacker attribution of audio deepfakes¹¹ shows that they can differentiate the type of deepfakes generator used to create the fakes used in ASVspoof19 dataset creation. This is done by extracting the neural embedding with the model described in the research paper¹² by Ye Jia et al. The researchers tried to classify the deepfake data that is OoD (iASVspoof21 in this scenario) and were able to get distinct clusters which indicate that the model can generalise and differentiate the attack signature from unknown sources. See in-domain evaluation of the neural attack signatures of ASVspoof19’s dataset below.

Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (11)

4. Single-speaker model

A single-speaker model is a deep learning model that is trained on a dataset of audio/speech from a single speaker. This type of model can be used to detect deepfakes by identifying inconsistencies in the audio/speech of the deepfake. For example, a single-speaker model can be used to identify changes in the speaker’s voice, such as changes in pitch, intonation, and pronunciation. These changes can be indicative of a deepfake.

Uncovering the Real Voice: How to Detect and Verify Audio Deepfakes (12)

Even without the help of technology, there are certain tell-tale signs of audio deepfakes at work, such as robotic-sounding speech, pitch-perfect pronunciation, and the absence of filler words. However, advanced audio deepfake generators may mask these indicators. It is crucial to use alternative channels or methods to confirm the authenticity of people & content.

The domain of audio deepfake detection is a rapidly evolving field, with continuous advancements in sophistication & technologies. As we try to build up defence against adversarial attacks, improve model generalisability, and adapt to the fast-evolving landscape of deepfake generators, it’s essential to acknowledge that audio deepfake detection should not be the only focus. Instead, a multi-faceted approach should be considered, combining various strategies & methods to identify and guard against deepfake threats. The techniques that we have covered above are just some of the ways to protect authenticity, prevent and detect audio deepfakes.

If you’ve been following our articles, you will notice that we thrive on exploring different possibilities and approaches to tackle a problem. Our work involves rigorous research and experimentation to validate our hypotheses and develop viable solutions to solve real world problems. Our expertise lies in the integration and processing of data from different sensory devices, including visual, acoustic, lidar, Wi-Fi, Sonar, etc. Our primary focus is advanced computer vision techniques and machine learning algorithms to extract meaningful insights and valuable information from a diverse range of data sources. Our latest posts include ‘Don’t Get Lost in Translation: Recognising Key Words in Speech without using Natural Language Processing’ by Alvin Wong and how our intern, Wen Hui, spent her summer holiday on ‘Building AI Solutions for Real-World Challenges as an Intern at HTX’

If you want to stay updated on our projects in different AI and sensors engineering fields, consider subscribing to our medium channel. Feel free to reach out to us at TAN_JIAN_HONG@htx.gov.sg or Alvin_WONG@htx.gov.sg (co-author Alvin Wong) if you want to learn more or discuss ideas related to Audio Deepfake Detection and Speaker Recognition.

How AI is restoring voices damaged by ALS using voice banking. The Washington Post. 20 Apr 2023. https://www.washingtonpost.com/wellness/interactive/2023/voice-banking-artificial-intelligence/
James Earl Jones is hanging up his cape as Darth Vader. Cable News Network (CNN). 26 Sep 2022. https://edition.cnn.com/2022/09/26/entertainment/james-earl-jones-darth-vader-retiring-cec/
https://www.cbc.ca/news/canada/newfoundland-labrador/deepfake-phone-scame-1.6793296
Deepfake video of Zelenskyy could be ‘tip of the iceberg’ in info war, experts warn. National Public Radio. 16 Mar 2022. https://www.npr.org/2022/03/16/1087062648/deepfake-video-zelenskyy-experts-war-manipulation-ukraine-russia
Does Audio Deepfake Detection Generalize? Fraunhofer AISEC. September 2022 https://www.isca-speech.org/archive/pdfs/interspeech_2022/muller22_interspeech.pdf
In-the-Wild Audio Deepfake Data. Fraunhofer AISEC. https://deepfake-demo.aisec.fraunhofer.de/in_the_wild
Does Audio Deepfake Detection Generalize? Fraunhofer AISEC. September 2022 https://www.isca-speech.org/archive/pdfs/interspeech_2022/muller22_interspeech.pdf
Deepfake Audio Detection by Speaker Verification. Alessandro Pianese, Davide Cozzolino, Giovanni Poggi and Luisa Verdoliva. September 2022. https://browse.arxiv.org/pdf/2209.14098.pdf
Defense Against Adversarial Attacks on Audio DeepFake Detection. Piotr Kawa, Marcin Plata, Piotr Syga. June 2023. https://arxiv.org/abs/2212.14597
Defense Against Adversarial Attacks on Audio DeepFake Detection. Piotr Kawa, Marcin Plata, Piotr Syga. June 2023. https://paperswithcode.com/paper/defense-against-adversarial-attacks-on-audio
Attacker Attribution of Audio Deepfakes. Nicolas M. Müller, Franziska Dieckmann, Jennifer Williams. March 2022. https://arxiv.org/abs/2203.15563
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis. Ye Jia et al. June 2018. https://arxiv.org/abs/1806.04558