Open Access archive

Deepfake defense: Combining spatial and temporal cues with CNN–BiLSTM–transformer architecture

Srijana Yadav, Manipal Institute of Technology
S. Sudheer Mangalampalli, Manipal Institute of Technology

Document Type

Article

Publication Title

Plos One

Abstract

The proliferation of deepfakes is a major threat to the believability of online media and the stability of public discourse. These hyper-realistic fake videos, nearly indistinguishable from genuine content, can be misused to spread disinformation, conduct identity theft, and manipulate political narratives. Most existing deepfake detectors analyze spatial or temporal features in isolation however, in real-world scenarios involving video compression, occlusions, or frame instability, such approaches are inadequate. Convolution Neural Networks (CNN) effectively capture spatial artifacts but fail to model temporal dynamics, while recurrent neural networks (RNNs) and long short-term memory (LSTM) units handle short-range temporal signals but struggle with long-term dependencies. To address these limitations, we propose a hybrid deep learning architecture that integrates CNN, bidirectional LSTMs (BiLSTMs), and transformer encoders within a unified framework. The CNN module extracts fine-grained spatial information from each frame, the BiLSTM branch captures local temporal motion, and the transformer encoder models global temporal relationships across video sequences. This dual-path temporal modeling framework leverages the strengths of both sequential learning and attention mechanisms to enable comprehensive spatiotemporal analysis. The model is implemented in TensorFlow using MobileNetV2 as its CNN backbone and evaluated on the FaceForensics++ and DeepFake Detection Challenge (DFDC) datasets. The proposed architecture demonstrates superior performance compared to baseline models such as XceptionNet, CNN–LSTM, and CNN–Transformer, achieving an F1-score of 90.6% and an AUC of 98.5%. In addition to high detection accuracy, the model exhibits strong robustness against video quality degradation, making it a practical and scalable solution for detecting deepfakes in critical and sensitive applications.

DOI

10.1371/journal.pone.0334980

Publication Date

11-1-2025

Recommended Citation

Yadav, Srijana and Mangalampalli, S. Sudheer, "Deepfake defense: Combining spatial and temporal cues with CNN–BiLSTM–transformer architecture" (2025). Open Access archive. 12291.
https://impressions.manipal.edu/open-access-archive/12291

This document is currently not available here.

COinS

Open Access archive

Deepfake defense: Combining spatial and temporal cues with CNN–BiLSTM–transformer architecture

Document Type

Publication Title

Abstract

DOI

Publication Date

Recommended Citation

Search

Browse

Author Corner

Open Access archive

Deepfake defense: Combining spatial and temporal cues with CNN–BiLSTM–transformer architecture

Authors

Document Type

Publication Title

Abstract

DOI

Publication Date

Recommended Citation

Share

Search

Browse

Author Corner