Shahed University

Video captioning using boosted and parallel Long Short-Term Memory networks

Masoomeh Nabati | Alireza Behrad

URL :   http://research.shahed.ac.ir/WSR/WebPages/Report/PaperView.aspx?PaperID=127043
Date :  2019/10/11
Publish in :    Computer Vision and Image Understanding
DOI :  https://doi.org/https://doi.org/10.1016/j.cviu.2019.102840
Link :  https://www.sciencedirect.com/science/article/pii/S1077314218301632?dgcid=rss_sd_all
Keywords :,Video captioning, Boosted and parallel LSTMs, AdaBoost algorithm

Abstract :
Video captioning and its integration with deep learning is one of the most challenging issues in the field of machine vision and artificial intelligence. In this paper, a new boosted and parallel architecture is proposed for video captioning using Long Short-Term Memory (LSTM) Networks. The proposed architecture comprises two LSTM layers and a word selection module. The first LSTM layer has the responsibility of encoding frame features extracted by a pre-trained deep Convolutional Neural Network (CNN). In the second LSTM layer, a novel architecture is used for video captioning by leveraging several decoding LSTMs in a parallel and boosting architecture. This layer, which is called Boosted and Parallel LSTM (BP-LSTM) layer, is constructed by iteratively training LSTM networks using a special kind of AdaBoost algorithm during the training phase. During the testing phase, the outputs of BP-LSTMs are concurrently combined using the maximum probability criterion and word selection module. We tested the proposed algorithm with two well-known video captioning datasets and compared the results with state-of-the-art algorithms. The results show that the proposed architecture considerably improves the accuracy of the generated sentence.