DSpace Repository

Image caption generation using Deep Q-learning framework

Show simple item record

dc.contributor.author Rana, Dipen Ganpat
dc.date.accessioned 2022-03-22T10:39:05Z
dc.date.available 2022-03-22T10:39:05Z
dc.date.issued 2021-07
dc.identifier.citation 31p. en_US
dc.identifier.uri http://hdl.handle.net/10263/7296
dc.description Dissertation under the supervision of Professor Rajat K De en_US
dc.description.abstract With the convolution neural network getting more and more popular in the last decade or so, thanks to the easy and a ordable access to high computation power, it has been easy to process huge amount of data through these. Machines are getting smarter and smarter with having capabilities in di erent domains. For exam- ple in Natural Language processing machines are now able to perform automatic language transla- tions, voice question and answers systems, writing abstracts from the given text and many advanced processing. One such problem is caption generation for images. In this the machine has to generate a one line descirption for the input image. Various approaches tries to solve this problem with the use of LSTMs and attention networks. In this project we propose a novel approach to solve the problem with the help of Deep Reinforce- ment learning. Our approach is based on the intution that in real life how the caption generation task is performed by an human language expert. A human expert will look at di erent parts of the image to get the keywords which should be present in the description of that image, and then it interprets the global information in the image to form the caption using the already sortlisted keywords. So hence we want to look at the local as well as global aspects in the image while formation of the caption, our model tries to mimic this approach where attention network and the LSTM network are used to capture local information and the reinforcement learning framework incorporates the global information in the process. In this project we are trying to train a framework to correctly highlight the important parts of the image for classi cation task which can be used in caption generation task. The input image is feed into a CNN network and the features extracted from this will be given to the attention network which will decide what features to select based on attention scores. Basically, attention is a mechanism by which a network can weigh features by level of importance to a task, and use this weighting to help achieve the task for predicting the description for the image. The dataset I will be using for this experiment is the MSCOCO dataset for the image caption generation task and training our model. In the experiment, the input image is passed through a CNN to get the features, then it is passed through an attention network at each time step for getting the weighted attended features, which are used as the input with the predicted caption till now to the policy network to get the probabilities for selecting the next word. Based on the scores for the candidates which are calculated as reward for the Reinforcement learning is used to select he right word next. In this way more accurate predictions are generated since the LSTM is taking care for local information and the Reinforcement network taking care for the global information in the image. en_US
dc.language.iso en en_US
dc.publisher Indian Statistical Institute, Kolkata en_US
dc.relation.ispartofseries Dissertation;;CS1901
dc.subject Image caption generation en_US
dc.subject Deep Q-learning framework en_US
dc.subject Long Short Term Memory Networks en_US
dc.subject Attention network en_US
dc.subject Gated Recurrent Unit en_US
dc.title Image caption generation using Deep Q-learning framework en_US
dc.type Other en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account