Multi-Modal Generative DeepFake Detection via Visual-Language Pretraining with Gate Fusion for Cognitive Computation

Abstract

With the widespread adoption of deep learning, there has been a notable increase in the prevalence of multimodal deepfake content. These deepfakes pose a substantial risk to both individual privacy and the security of their assets. In response to this pressing issue, researchers have undertaken substantial endeavors in utilizing generative AI and cognitive computation to leverage multimodal data to detect deepfakes. However, the efforts thus far have fallen short of fully exploiting the extensive reservoir of multimodal feature information, which leads to a deficiency in leveraging spatial information across multiple dimensions. In this study, we introduce a framework called Visual-Language Pretraining with Gate Fusion (VLP-GF), designed to identify multimodal deceptive content and enhance the accurate localization of manipulated regions within both images and textual annotations. Specifically, we introduce an adaptive fusion module tailored to integrate local and global information simultaneously. This module captures global context and local details concurrently, thereby improving the performance of image bounding-box grounding within the system. Additionally, to maximize the utilization of semantic information from diverse modalities, we incorporate a gating mechanism to strengthen the interaction of multimodal information further. Through a series of ablation experiments and comprehensive comparisons with state-of-the-art approaches on extensive benchmark datasets, we empirically demonstrate the superior efficacy of VLP-GF.

Publication
Springer
Mingliang Gao
Mingliang Gao
Associate Professor