Dealing with word ambiguity in NLP. Building appropriate sense representations for Danish sense tagging by combining word embeddings with wordnet senses

Rørmann Olsen, Ida

Sammanfattning

This thesis describes an approach to handle word sense in natural language processing. If we want language technologies to handle word ambiguity, then machines need proper sense representations. In a case study on Danish ambiguous nouns, we examined the possibility of building an appropriate sense inventory by combining the distributional information of a word from a vector space model with knowledge-based information from a wordnet. We tested three sense representations in a word sense disambiguation task: firstly, the centroids (average of words) of selected wordnet synset information and members, secondly the centroids of wordnet sample sentence alone, and thirdly the centroids of un-labelled sample sentences clustered around the wordnet sample sentence. Finally, we tested the features of the cluster members and evaluation data in supervised machine learning classifiers. The sense representations in all experiments generally beat the random baseline significantly, but not the most frequent sense as default. The representations made from selected wordnet synset information and synset members proved to generally give the best result, especially for those target words with rich synset information. The machine learning classifiers outperformed the sense representations significantly on the word sense disambiguation task. The best classifiers were those trained and tested on either the clustered data or the evaluation data. We conclude that the combination of word embeddings and wordnet associated data used to build a proper sense representation seems promising. However, we suggest some improvements for future work, specifically on the extracted information from wordnet sample sentences.

Examinationsnivå

Student essay

Datum

2018-12-13

Författare

Rørmann Olsen, Ida

Nyckelord

sense embeddings

wordnet

word2vec

word sense disambiguation

clustering

machine learning

supervised WSD

Publikationstyp

Språk

eng

Metadata

Visa fullständig post