Filippatou, Viktoria2024-11-282024-11-282024-11-28https://hdl.handle.net/2077/84376Figurative language is an integral part of human communication and everyday life. As a Natural Language Processing task it has long been the focus of attention in research, and recently it has been translated into a vision and language task, where multi-modal models seem to outperform uni-modal ones. This thesis explores how a vision and language transformer-based model, specifically VisualBERT, understands figurative language -idioms, metaphors, and similes- and examines if its visual embeddings can be enhanced to align better with figurative meaning. Understanding these alignments is critical for assessing whether these models can truly grasp the abstract and symbolic layers of language, beyond surface-level pattern recognition. Through a series of experiments and attention analysis, this research highlights both the potential and limitations of a vision and language model, illuminating the broader challenges in grounding language to visual contexts.engfigurative language, vision, language, VisualBertFINDING MEANING IN A HAYSTACK: On How Vision and Language Models Process Figurative LanguageFINDING MEANING IN A HAYSTACK: On How Vision and Language Models Process Figurative LanguageText