Abstract:
With the convolution neural network getting more and more popular in the last decade or so,
thanks to the easy and a ordable access to high computation power, it has been easy to process
huge amount of data through these.
Machines are getting smarter and smarter with having capabilities in di erent domains. For exam-
ple in Natural Language processing machines are now able to perform automatic language transla-
tions, voice question and answers systems, writing abstracts from the given text and many advanced
processing. One such problem is caption generation for images. In this the machine has to generate
a one line descirption for the input image. Various approaches tries to solve this problem with the
use of LSTMs and attention networks.
In this project we propose a novel approach to solve the problem with the help of Deep Reinforce-
ment learning. Our approach is based on the intution that in real life how the caption generation
task is performed by an human language expert. A human expert will look at di erent parts of
the image to get the keywords which should be present in the description of that image, and then
it interprets the global information in the image to form the caption using the already sortlisted
keywords. So hence we want to look at the local as well as global aspects in the image while
formation of the caption, our model tries to mimic this approach where attention network and the
LSTM network are used to capture local information and the reinforcement learning framework
incorporates the global information in the process.
In this project we are trying to train a framework to correctly highlight the important parts of
the image for classi cation task which can be used in caption generation task. The input image
is feed into a CNN network and the features extracted from this will be given to the attention
network which will decide what features to select based on attention scores. Basically, attention is
a mechanism by which a network can weigh features by level of importance to a task, and use this
weighting to help achieve the task for predicting the description for the image.
The dataset I will be using for this experiment is the MSCOCO dataset for the image caption
generation task and training our model.
In the experiment, the input image is passed through a CNN to get the features, then it is passed
through an attention network at each time step for getting the weighted attended features, which are
used as the input with the predicted caption till now to the policy network to get the probabilities
for selecting the next word. Based on the scores for the candidates which are calculated as reward
for the Reinforcement learning is used to select he right word next. In this way more accurate
predictions are generated since the LSTM is taking care for local information and the Reinforcement
network taking care for the global information in the image.