Abstract:
In recent years, tremendous progress has been made in the fields of object
detection, computer vision, and natural language processing. Artificial intelligence
Systems (AI), such as question-answering models provide the machine with
"comprehensive" capabilities using natural language processing. Such a machine
can respond to queries in natural language about an unstructured text. For performing
the task of VQA, we can combine Natural language processing with computer
vision.The purpose of a visual question answering system is to create a system
capable of answering natural language queries about images. A number of
systems have been introduced for visual question answering that use learning algorithms
and deep-learning architectures.
This project introduces a VQA system that uses deep understanding of images using
a deep convolutional neural network (CNN) that helps to extract features from
image and LSTM are used for word embeddings for question texts.in this project
we are taking only those questions that have answer type yes or no. Hence, Our
system achieves complex reasoning and natural language understanding so that
it can correctly predict the request and give the appropriate answer yes or no. Different
architectures are introduced to combine the image and language models.