There’s a Microwave in the Hallway

Emampoor, Yasmeen

Sammanfattning

Embodied Question Answering (EQA) is a task in which an agent situated in virtual environment navigates from its current position to an object (Navigation), and then answer a question about it (Visual Question Answering, VQA), for example “What color is the table in the table in the kitchen?” This project examines how an agent modelled as a deep neural network uses semantic information from its language model and visual information to answer questions in the second task. This is important since due to the regular nature of the task and the dataset it could be that the model is answering questions purely based on general semantic information from its language model (tables are frequently brown) and not relying on the visual scene, a phenomenon that is commonly known as hallucinating. This project first examines the quality of the current task dataset, EQA-MP3D, and presents a series of experiments where the visual information given to the model is manipulated or corrupted. Next, this model is extended, giving it new sources of information with an expectation that the model would use it to improve grounding of questions and answers in perception. Structured information is found to be particularly helpful, in the form of identified object regions. Additionally, we examine the impact of question types on performance. The dataset includes 3 distinct question types, color, color room, and location. The baseline performance differs across types. The performance is also impacted by changes in the input differently by question type.

Examinationsnivå

Student essay

Datum

2022-04-20

Författare

Emampoor, Yasmeen

Nyckelord

embodied question answering

visual question answering

multi-modality

information fusion

Språk

eng

Metadata

Visa fullständig post