Image caption generation using Deep Q-learning framework

Rana, Dipen Ganpat

dc.contributor.author	Rana, Dipen Ganpat
dc.date.accessioned	2022-03-22T10:39:05Z
dc.date.available	2022-03-22T10:39:05Z
dc.date.issued	2021-07
dc.identifier.citation	31p.	en_US
dc.identifier.uri	http://hdl.handle.net/10263/7296
dc.description	Dissertation under the supervision of Professor Rajat K De	en_US
dc.description.abstract	With the convolution neural network getting more and more popular in the last decade or so, thanks to the easy and a ordable access to high computation power, it has been easy to process huge amount of data through these. Machines are getting smarter and smarter with having capabilities in di erent domains. For exam- ple in Natural Language processing machines are now able to perform automatic language transla- tions, voice question and answers systems, writing abstracts from the given text and many advanced processing. One such problem is caption generation for images. In this the machine has to generate a one line descirption for the input image. Various approaches tries to solve this problem with the use of LSTMs and attention networks. In this project we propose a novel approach to solve the problem with the help of Deep Reinforce- ment learning. Our approach is based on the intution that in real life how the caption generation task is performed by an human language expert. A human expert will look at di erent parts of the image to get the keywords which should be present in the description of that image, and then it interprets the global information in the image to form the caption using the already sortlisted keywords. So hence we want to look at the local as well as global aspects in the image while formation of the caption, our model tries to mimic this approach where attention network and the LSTM network are used to capture local information and the reinforcement learning framework incorporates the global information in the process. In this project we are trying to train a framework to correctly highlight the important parts of the image for classi cation task which can be used in caption generation task. The input image is feed into a CNN network and the features extracted from this will be given to the attention network which will decide what features to select based on attention scores. Basically, attention is a mechanism by which a network can weigh features by level of importance to a task, and use this weighting to help achieve the task for predicting the description for the image. The dataset I will be using for this experiment is the MSCOCO dataset for the image caption generation task and training our model. In the experiment, the input image is passed through a CNN to get the features, then it is passed through an attention network at each time step for getting the weighted attended features, which are used as the input with the predicted caption till now to the policy network to get the probabilities for selecting the next word. Based on the scores for the candidates which are calculated as reward for the Reinforcement learning is used to select he right word next. In this way more accurate predictions are generated since the LSTM is taking care for local information and the Reinforcement network taking care for the global information in the image.	en_US
dc.language.iso	en	en_US
dc.publisher	Indian Statistical Institute, Kolkata	en_US
dc.relation.ispartofseries	Dissertation;;CS1901
dc.subject	Image caption generation	en_US
dc.subject	Deep Q-learning framework	en_US
dc.subject	Long Short Term Memory Networks	en_US
dc.subject	Attention network	en_US
dc.subject	Gated Recurrent Unit	en_US
dc.title	Image caption generation using Deep Q-learning framework	en_US
dc.type	Other	en_US