dc.description.abstract | Visual Relationship Detection (VRD) is a relatively young research area, where the
goal is to develop prediction models for detecting the relationships between objects
depicted in an image. A relationship is modeled as a subject-predicate-object triplet,
where the predicate (e.g an action, a spatial relation, etc. such as “eat”, “chase”
or “next to”) describes how the subject and the object are interacting in the given
image. VRD can be formulated as a classification problem, but suffers from the
effects of having a combinatorial output space; some of the major issues to overcome
are long-tail class distribution, class overlapping and intra-class variance. Machine
learning models have been found effective for the task and, more specifically, many
works proved that combining visual, spatial and semantic features from the detected
objects is key to achieving good predictions. This work investigates on the use of
distributional embeddings, often used to discover/encode semantic information, in
order to improve the results of an existing neural network-based architecture for
VRD. Some experiments are performed in order to make the model semantic-aware
of the classification output domain, namely, predicate classes. Additionally, different
word embedding models are trained from scratch to better account for multi-word
objects and predicates, and are then fine-tuned on VRD-related text corpora.
We evaluate our methods on two datasets. Ultimately, we show that, for some set of
predicate classes, semantic knowledge of the predicates exported from trained-fromscratch
distributional embeddings can be leveraged to greatly improve prediction,
and it’s especially effective for zero-shot learning. | sv |