AntiDeepFake: AI for Deep Fake Speech Recognition (2024)

Enkhtogtokh TogootogtokhTechnidoo Solutions Lab
Technidoo Solutions Germany and Mongolian University of Science and Technology
Bavaria, Germany
enkhtogtokh.java@gmail.com, togootogtokh@technidoo.com Christian KlasenTechnidoo Solutions Lab
Technidoo Solutions Germany
Bavaria, Germany
klasen@technidoo.com

Abstract

In this research study, we propose a modern artificial intelligence (AI) approach to recognize deepfake voice, also known as generative AI cloned synthetic voice. Our proposed AI technology, called AntiDeepFake, consists of all main pipelines from data to evaluation in the whole picture. We provide experimental results and scores for all our proposed methods. The main source code for our approach is available in the provided link:https://github.com/enkhtogtokh/antideepfake repository.

Index Terms:

Anti DeepFake, AI DeepFake Detection, Voice Clone Recognition, Synthetic Voice Recognition, DeepFake Recognition, Spoof Recognition, AI for Anti Spoof

I Introduction

Deepfake technology refers to the use of artificial intelligence and machine learning algorithms to create, manipulate, or enhance digital images, videos, and audio recordings in a way that makes them appear real and authentic. This technology has the potential to be used for a variety of purposes, including entertainment, marketing, and even political propaganda. However, it also has the potential to be used for malicious purposes, such as spreading misinformation, blackmail, and identity theft.

One of the main concerns about deepfake technology is its ability to create highly convincing fake videos and audios that can be used to manipulate public opinion and spread false information. As example, a deepfake video and audio could be used to make it appear as though a politician said something they never actually said, or to make it appear as though a celebrity endorsed a product they never actually used. This could have serious consequences for the reputation and credibility of the individuals involved, as well as for the wider public.

Another concern about deepfake technology is its potential to be used for identity theft. With the ability to create highly convincing fake audios and videos, it is possible for someone to impersonate another person and use their identity to commit crimes or access sensitive information. This could have serious legal and financial consequences for the individuals involved, as well as for the wider society.

In addition to these concerns, deepfake technology also raises important ethical questions about the use of artificial intelligence and the potential for technology to be used to manipulate and deceive people. As this technology continues to develop, it is important to consider the potential risks and benefits, and to develop appropriate safeguards and regulations to ensure that it is used in a responsible and ethical manner.

With current generative AI technology, deepfake technology has become increasingly prevalent, with the ability to create highly convincing audio and video clones of individuals. This has raised concerns about the potential for misuse and the need for effective methods to detect and prevent the spread of deepfakes. In this paper, we propose a modern artificial intelligence (AI) approach to recognize deepfake voice, which we call AntiDeepFake.

The AntiDeepFake system consists of five main pipelines, from data pipe to performance evaluation. The first pipeline involves collecting a large dataset of real and deepfake audio samples, which are then preprocessed feature extraction and engineering to data split. The second pipeline involves training a gradient boosted and tabular encoding decoding architecture to classify the audio samples as either real or deepfake.

We present experimental results and scores for the AntiDeepFake system, which demonstrate its effectiveness in recognizing deepfake voice. Our results show that the system achieves high accuracy and robustness, even when faced with challenging edge cases and synthetic deepfake samples. We also provide the main source code for the AntiDeepFake system, which can be used by other researchers and developers to build upon our work and further improve the performance of deepfake voice recognition systems.

Deepfake technology poses a significant threat to the integrity of audio and video content, and there is a urgent need for effective methods to detect and prevent the spread of deepfakes. Our proposed AI approach,AntiDeepFake, provides a robust and accurate solution for recognizing deepfake voice, and has the potential to significantly improve the security and reliability of digital media.

Concretely, the key contributions of the proposed work are:

•
The industry level AI technology for AntiDeepFake
•
The tabular AI framework for tabular classification, regression, and other tasks

Systematic experiments conducted on real-world acquired data have shown as:

•
It is possible to be common framework for many type of deep fake speeches and synthetic data recognition.
See Also
DEEPFAKE-Anleitung mit kostenlosem KI-Tool auf Ihrem lokalen Computer – Shulex VOC Blog Top 5 Best deep fake voice generator | Tried and Tested Results Detecting the Deceptive: Unmasking Deep Fake Voices Audio-Deepfakes und Voice-Cloning: So schützen Sie sich vor Betrug
•
It is possible to achieve 99.9% accuracy on well prepared training data to recognize.

The rest of the paper is organized as follows.The proposed framework is described in Section II.The data pipeline is explained in Section II-A.The state-of-the-art (SOTA) models are explained in Section II-D.The details about the experimental results and evaluations are presented in Section III.Finally, Section IV provides the conclusions and future work.

AntiDeepFake: AI for Deep Fake Speech Recognition (1)

II The proposed Anti DeepFake Voice Architecture (AntiDeepFake)

In this section, we discuss the proposed AntiDeepFake approach for deepfake speech recognition as shown in Figure 1.The AntiDeepFake has five main pipelines which are the data processing to extract efficient features and AI evaluation and experimental results as shown in Figure 2. We discuss them in detail with coming sections.

II-A The data pipeline

Audio significant feature extraction is the important part of modern deep learning. There are many mechanisms to do it. Here we extract melspectrogram audio feature later to train machine with high accuracy.

II-A1 The data collection

To collect an audio dataset labeled as genuine simply as real and spoof simply as fake, several methods can be employed, including crawling technologies and existing audio resources with ethical considerations.

•
1. For real audio datasets, speech datasets such as ljspeech and public speeches can be utilized, as they are readily available and have no restrictions, particularly for academic research purposes.
•
2. On the other hand, collecting fake audio dataset is relatively straightforward due to the advancements in modern synthetic voice cloning AI models, such as recent state-of-the-art generative synthetic voice cloning AI with generative pretrained transformer (GPT) architecture technologies.

II-A2 The data transformation

Furthermore, the audio voice feature extraction process will result in tabular data, which we will discuss in detail in the following sections.

II-B The Feature Extraction

In order to analyze the unique characteristics of real human voice, it is necessary to extract all relevant features from an audio dataset. Audio significant feature extraction is the important part of modern AI and deep learning [1]. In this study, we have extracted the most significant features from the dataset for the purpose of further feature engineering. The extracted features, as presented in Table I, will be utilized in subsequent analyses.

AntiDeepFake: AI for Deep Fake Speech Recognition (2)

Feature Name	Variations
Pitch	STD, Mean
Shimmer	STD, Mean
Jitter	STD, Mean
Formants	F1, F2, F3, F4
Chroma STFT	STD, Mean
RMSS	STD, Mean
Spectral Centroids	STD, Mean
Spectral Bandwidths	STD, Mean
Rolloffs	STD, Mean
Zero Crossing Rates	STD, Mean
MFCC	STD, Mean

II-C The Feature Engineering (Selection)

In practice, it is not necessary to consider all extracted features, rather, it is more appropriate to focus on the most significant features, which can be referred to as feature selection or engineering.The importance of feature engineering lies in the fact that the quality and relevance of the features used in a model can greatly affect its ability to accurately predict outcomes. In addition, feature engineering can help to address issues such as missing data and outliers, which can be common in real-world datasets. By handling these issues, the model can be more robust and better able to handle variations in the data.

Overall, feature engineering plays a critical role in machine learning by helping to ensure that the model is able to accurately capture the underlying patterns and relationships in the data, and by improving the efficiency and accuracy of the model.

In general, there are three primary categories of methodology for feature engineering:

•
Filter. As example, Pearson
•
Wrapper. As example, Custom Boosted Tree based new approach (RFE) – SOTA
•
Embedded. As example, Lasso Regularization
•
Ensemble. As example, Combination of above 3 methodolies
See Also
All About Deepfake Voices | Speechify

In this study, we applied a novel custom wrapper category methodology as a gradient boosted Recursive feature elimination (RFE) approach. In our forthcoming research paper, we will elaborate on this custom feature engineering methodology.

The success of any AI model in addressing a given problem is highly dependent on the quality and quantity of the data pipeline used. Therefore, it is crucial to have a well-proposed data pipeline in place before implementing any SOTA AI models or other advanced techniques.

The following is the algorithm for obtaining data processing:

1:data = collect_data() // Collect data

2:data = clean_data(data) // Clean data

3:explore_data(data) // Explore data

4:data = extract_features(data) // Data Transformation and Extract features

5:data = select_features(data)

6:X_train,y_train,X_test,y_test = train_test_split(data)

7:return X_train,y_train,X_test,y_test

II-D The AI modeling

In the field of tabular AI classification, gradient boosted AI models and deep tabular data learning architectures that incorporate sequential attention models have emerged as the dominant approaches. Which means state-of-the-art models are as example CatBoost[2] TabNet[4] , XGBoost[3] . The expermintal results discuss in Section III. It will show how their comparision and evaluation results.

II-E The AI training

In our research, we train on all state-of-the-art (SOTA) models described in Section II-D, and the main training algorithm employed is presented below:

1:X_train, y_train, X_test, y_test = Process_Data()

2:sota_models = [TabNet(),XGBoost(),CatBoost()]

3:model = sota_models[i], i= selected index

4:model.fit(X_train,y_train)

5:model.save(model_save_path)

II-F The inference

After training process, the main inference algorithm employed is presented below:

1:model = load(model_save_path)

2:y_hat = model.predict(inference_data)

3:return y_hat

III Experimental Results and Evaluation

In this section, we present the setup of our research and then evaluate the state-of-the-art (SOTA) models using the selected features. We conduct experiments in systematic scenarios to analyze the performance of the models.

III-A Setup

We train and test on ubuntu 18 machine with capacity of (CPU: Intel(R) Xeon(R) CPU @ 2.20GHz, RAM:16GB, GPU: NVidia GeForce GTX 1070, 16 GB).

III-B The dataset

In this research paper, we present a dataset that was collected in accordance with the description provided in Section II-A1. The dataset contains two distinct speech labels, namely real and fake speeches.

III-C The AI Scores

Along with classification accuracy, further metrics are also considered for model comparison. These are precision, which measures the rate at which predicted positives are correct among all positive predictions, and allows for false-positive analysis.Precision is important since a high precision would minimise false accusations of AI-generated speech when the audio is, in fact, natural voice. Similarly, recall which is a measure of how many positive cases are correctly predicted, which enables analysis of false-negative predictions.Higher recall suggests that the model is not falsely classifying AI-generated speech as human speech. These results are then combined to compute the F-1 score.

The accuracy of training and testing for SOTA models is presented in Table II. Among the models evaluated, the CatBoost model achieved the highest score

AI model	Training Accuracy	Testing Accuracy
CatBoost[2]	1.0	0.937
XGBoost[3]	1.0	0.925
TabNet[3]	0.86	0.80

The following table III presents the main scores for the CatBoost[2] model.

Name	Precision	Recall	F1-score
Real	0.94	0.93	0.94
Fake	0.93	0.94	0.94
Accuracy			0.94
Macro avg.	0.94	0.94	0.94
Weighted avg.	0.94	0.94	0.94

AntiDeepFake: AI for Deep Fake Speech Recognition (3)

The following table IV presents the main scores for the XGBoost[3] model.

Name	Precision	Recall	F1-score
Real	0.94	0.91	0.92
Fake	0.91	0.94	0.93
Accuracy			0.93
Macro avg.	0.93	0.93	0.93
Weighted avg.	0.93	0.93	0.93

AntiDeepFake: AI for Deep Fake Speech Recognition (4)

The following table V presents the main scores for the TabNet[4] model.

Name	Precision	Recall	F1-score
Real	0.83	0.78	0.80
Fake	0.79	0.84	0.81
Accuracy			0.81
Macro avg.	0.81	0.81	0.81
Weighted avg.	0.81	0.81	0.81

AntiDeepFake: AI for Deep Fake Speech Recognition (5)

IV Conclusion

This research paper focuses on the recognition of generative AI, specifically, those related to deepfake with AI-generated synthetic human speech. The contributions of this study include the development a comprehensive analysis of the modern AI pipeline significance of audio features extraction and engineering from real and AI-generated synthetic speech, and the SOTA of AI models that can predict the both real and synthetic of speech.In future works, we will publish next series of research to apply on AntiDeepFake for video analysis tasks.

References

[1]Togootogtokh, Enkhtogtokh, and Christian Klasen. ”DeepEMO: deep learning for speech emotion recognition.” arXiv preprint arXiv:2109.04081 (2021).
[2]Prokhorenkova, Liudmila, et al. ”CatBoost: unbiased boosting with categorical features.” Advances in neural information processing systems 31 (2018).
[3]Chen, Tianqi, and Carlos Guestrin. ”Xgboost: A scalable tree boosting system.” Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.
[4]Arik, Sercan Ö., and Tomas Pfister. ”Tabnet: Attentive interpretable tabular learning.” Proceedings of the AAAI conference on artificial intelligence. Vol. 35. No. 8. 2021.