Shahed University

Multimodal video-text matching using a deep bifurcation network and joint embedding of visual and textual features

Masoomeh Nabati | Alireza Behrad

URL :   http://research.shahed.ac.ir/WSR/WebPages/Report/PaperView.aspx?PaperID=158660
Date :  2021/07/05
Publish in :    Expert Systems With Applications

Link :  https://www.sciencedirect.com/science/article/abs/pii/S0957417421009489
Keywords :,Video-text matching, Video-caption retrieval, Bifurcation network, Deep neural network

Abstract :
Video-text matching is of high importance in the field of machine vision and artificial intelligence. The main challenging issue in video-text matching is the projection of the video and textual features into a common semantic space, which is called video-text joint embedding. The proper functionality of video-text joint embedding depends on two important factors: the effectiveness of the extracted information for video-text matching and the suitability of network structure for the projection of the extracted features into a common space. Generally, existing approaches do not leverage all the audio and visual information of a video for video-text matching. This study proposes a new approach for video-text matching by extracting a comprehensive set of visual and textual features and projecting them into a common semantic space using an effective structure. We use a new deep network with two textual and video branches that extracts several informative high-level visual and textual features and maps them into a shared space. In the video branch, we extract several descriptors comprising appearance-based, concept-based, and action-based features as well as an audio encoder to extract sound features in the video clip. We also utilize several feature extraction approaches in the textual branch for the effective transformation of an input sentence to a common semantic space. Furthermore, an image description database is used to pre-train and initialize network weights. We evaluated the proposed architecture with two popular video description datasets and compared the results with the results of several state-of-the-art approaches. The comparison results showed the effectiveness of the proposed algorithm.