AI-Driven Multi- Modal Fake Content Detection System Using Audio-Text Fusion and Transformer Network

S. Jeeva; J. S.  Trisha; S.  Keerthana

doi:10.51903/jtie.v5i1.475

Authors

S. Jeeva Arunai Engineering College, Tiruvannamalai, Tamil Nadu, India https://orcid.org/0009-0000-5580-0169
J. S. Trisha Arunai Engineering College, Tiruvannamalai, Tamil Nadu, India https://orcid.org/0009-0008-9291-223X
S. Keerthana Arunai Engineering College, Tiruvannamalai, Tamil Nadu, India https://orcid.org/0009-0004-9374-7986

DOI:

https://doi.org/10.51903/jtie.v5i1.475

Keywords:

Audio-Text Fusion, Deepfake, Fake Content Detection, Transformer, Machine Learning

Abstract

The rapid proliferation of AI-generated synthetic media has posed substantial threats to digital trust, particularly through audio deepfakes and manipulated text. Existing unimodal detection systems that analyze either audio or text in isolation remain insufficient to counter advanced generative attacks that exploit both modalities simultaneously. This paper proposes an AI-driven multimodal fake content detection framework that jointly leverages acoustic and linguistic signals to enable robust deepfake identification. Mel-Frequency Cepstral Coefficients (MFCCs) and Mel-Spectrograms are extracted from raw audio to capture spectral and temporal vocal patterns. At the same time, BERT-based transformer embeddings encode semantic and contextual information from transcripts generated via Automatic Speech Recognition (ASR). An attention-based fusion layer dynamically weights and integrates both feature streams, and a Random Forest–XGBoost ensemble classifier performs the final authenticity prediction. Experiments conducted on the ASVspoof 2019 benchmark demonstrate a classification accuracy of 95%, with precision of 93%, recall of 94%, and F1-score of 95%, outperforming standalone audio-only and text-only baselines by approximately 4–7%. These findings confirm that cross-modal feature fusion substantially reduces false-detection rates and improves generalization over single-modality approaches. The proposed system offers practical applicability in cybersecurity, voice biometrics, and digital forensics.

References

Afchar, D., Nozick, V., Yamagishi, J., & Echizen, I. (2018). MesoNet: A Compact Facial Video Forgery Detection Network. 2018 IEEE International Workshop on Information Forensics and Security (WIFS), 1–7. https://doi.org/10.1109/wifs.2018.8630761

Albahar, M. A., & Almalki, J. (2022). Deepfake Detection Using Deep Learning Methods: A Systematic Review. Applied Sciences, 12(1), 1–23. https://doi.org/10.3390/app12010152

Al Alim, A., Yuyen, G. F., Evangelina, I. G., & Lie, K. (2025). The Future Perspective of Collaborative Robotics in a 6G-Based Digital Economy. Jurnal Ilmiah Sistem Informasi, 4(2), 186–196. https://doi.org/10.51903/8289x083

Ali, S., Wang, J., & Khan, A. (2023). Audio Deepfake Detection Using Spectrogram-Based CNN Architectures. IEEE Access, 11, 44213–44226. https://doi.org/10.1109/access.2023.3272092

Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Advances in Neural Information Processing Systems, 33, 12449–12460. https://proceedings.neurips.cc/paper/2020/file/92dcca6ad90c6604-paper.pdf

Ciftci, U. A., Demir, I., & Yin, L. (2022). FakeCatcher: Detection of Synthetic Portrait Videos Using Biological Signals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 6984–6997. https://doi.org/10.1109/tpami.2020.3001445

Dang, H., Liu, F., Stehouwer, J., Liu, X., & Jain, A. K. (2020). On the Detection of Digital Face Manipulation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5781–5790. https://doi.org/10.1109/cvpr42600.2020.00582

Das, A., & Das, P. (2024). Multimodal Deepfake Detection Using Audio-Text Fusion and Transformer Networks. Pattern Recognition Letters, 181, 15–23. https://doi.org/10.1016/j.patrec.2024.03.008

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186. https://doi.org/10.18653/v1/n19-1423

Güera, D., & Delp, E. J. (2018). Deepfake Video Detection Using Recurrent Neural Networks. 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 1–6. https://doi.org/10.1109/avss.2018.8639163

Hartono, B., Silalahi, F. D., & Muthohir, M. (2024). Transformers in Cybersecurity: Advancing Threat Detection and Response through Machine Learning Architectures. Journal of Technology Informatics and Engineering, 3(3), 382–396. https://doi.org/10.51903/jtie.v3i3.211

Hasan, M., Rahman, M., & Karim, A. (2023). Transformer-Based Audio Deepfake Detection Using Spectral Features. IEEE Signal Processing Letters, 30, 1722–1726. https://doi.org/10.1109/lspl.2023.3331454

Jung, T., Kim, S., & Kim, J. (2020). DeepVision: Deepfakes Detection Using Convolutional Neural Networks. IEEE Access, 8, 151507–151518. https://doi.org/10.1109/access.2020.3017347

Khalid, S., Lee, J., Kim, H., & Woo, S. (2021). FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset. IEEE Access, 9, 138845–138855. https://doi.org/10.1109/access.2021.3118461

Korshunov, P., & Marcel, S. (2020). Speaker Verification Spoofing with Deepfake Speech. IEEE Journal of Selected Topics in Signal Processing, 14(5), 1028–1040. https://doi.org/10.1109/jstsp.2020.3015430

Li, Y., Chang, M. C., & Lyu, S. (2018). In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking. 2018 IEEE International Conference on Image Processing (ICIP), 281–285. https://doi.org/10.1109/icip.2018.8451592

Liu, H., Si, M., & Zhao, Y. (2024). Attention-Based Multimodal Fusion for Deepfake Audio Detection. Knowledge-Based Systems, 295, 110864. https://doi.org/10.1016/j.knosys.2024.110864

Mai, N. T., & Khalid, I. (2025). Human Error vs. System Security: Evaluating the Weakest Link in Digital Business Information Systems. Journal of Management and Informatics, 4(3), 981–997. https://doi.org/10.51903/jmi.v4i3.305

Mesaros, A., Heittola, T., & Virtanen, T. (2021). Metrics for Audio Deepfake Detection Evaluation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1770–1784. https://doi.org/10.1109/taslp.2021.3074005

Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., & Manocha, D. (2020). Emotions Don’t Lie: An Audio-Visual Deepfake Detection Method. Proceedings of the 28th ACM International Conference on Multimedia, 2823–2832. https://doi.org/10.1145/3394171.3413550

Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: A Large-Scale Speaker Identification Dataset. Interspeech 2017, 2616–2620. https://doi.org/10.21437/interspeech.2017-950

Nguyen, T. T., Nguyen, C. M., Nguyen, D. T., Nguyen, D. T., & Nahavandi, S. (2022). Deep Learning for Deepfakes Creation and Detection: A Survey. Computer Vision and Image Understanding, 223, 103525. https://doi.org/10.1016/j.cviu.2022.103525

Pahwa, S., Agarwal, S., & Goel, A. (2023). Fake Speech Detection Using Ensemble Learning Techniques. Multimedia Tools and Applications, 82, 36721–36738. https://doi.org/10.1007/s11042-023-15064-2

Singh, A., & Purohit, H. (2024). Robust Audio Deepfake Detection Using MFCC and Transformer Architectures. Expert Systems with Applications, 238, 121906. https://doi.org/10.1016/j.eswa.2023.121906

Tak, H., Patino, J., Todisco, M., Nautsch, A., Evans, N., & Larcher, A. (2021). End-to-End Anti-Spoofing with Raw Waveform CLDNNs. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 317–328. https://doi.org/10.1109/taslp.2020.3040375

Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, A., Yamagishi, J., Evans, N., Kinnunen, T., & Lee, K. A. (2019). ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection. Interspeech 2019, 1008–1012. https://doi.org/10.21437/interspeech.2019-2249

Wang, X., Yamagishi, J., & Todisco, M. (2023). Generalization-Aware Spoofing Countermeasures for Deepfake Audio. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 2543–2556. https://doi.org/10.1109/taslp.2023.3289012

Zhang, C., Wang, Y., & Zhao, Z. (2024). Multi-Modal Fake Content Detection Using Attention-Based Deep Learning. Information Fusion, 98, 101857. https://doi.org/10.1016/j.inffus.2023.101857

AI-Driven Multi- Modal Fake Content Detection System Using Audio-Text Fusion and Transformer Network

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

full sidebar