Image caption generation using Deep Q-learning framework

Rana, Dipen Ganpat

Image caption generation using Deep Q-learning framework

Rana, Dipen Ganpat

Date: 2021-07

Abstract:

With the convolution neural network getting more and more popular in the last decade or so, thanks to the easy and a ordable access to high computation power, it has been easy to process huge amount of data through these. Machines are getting smarter and smarter with having capabilities in di erent domains. For exam- ple in Natural Language processing machines are now able to perform automatic language transla- tions, voice question and answers systems, writing abstracts from the given text and many advanced processing. One such problem is caption generation for images. In this the machine has to generate a one line descirption for the input image. Various approaches tries to solve this problem with the use of LSTMs and attention networks. In this project we propose a novel approach to solve the problem with the help of Deep Reinforce- ment learning. Our approach is based on the intution that in real life how the caption generation task is performed by an human language expert. A human expert will look at di erent parts of the image to get the keywords which should be present in the description of that image, and then it interprets the global information in the image to form the caption using the already sortlisted keywords. So hence we want to look at the local as well as global aspects in the image while formation of the caption, our model tries to mimic this approach where attention network and the LSTM network are used to capture local information and the reinforcement learning framework incorporates the global information in the process. In this project we are trying to train a framework to correctly highlight the important parts of the image for classi cation task which can be used in caption generation task. The input image is feed into a CNN network and the features extracted from this will be given to the attention network which will decide what features to select based on attention scores. Basically, attention is a mechanism by which a network can weigh features by level of importance to a task, and use this weighting to help achieve the task for predicting the description for the image. The dataset I will be using for this experiment is the MSCOCO dataset for the image caption generation task and training our model. In the experiment, the input image is passed through a CNN to get the features, then it is passed through an attention network at each time step for getting the weighted attended features, which are used as the input with the predicted caption till now to the policy network to get the probabilities for selecting the next word. Based on the scores for the candidates which are calculated as reward for the Reinforcement learning is used to select he right word next. In this way more accurate predictions are generated since the LSTM is taking care for local information and the Reinforcement network taking care for the global information in the image.