Machine Learning for Detecting Hate Speech in Low Resource Languages

Rodriguez, David; Saynova, Denitsa

Machine Learning for Detecting Hate Speech in Low Resource Languages

Sammanfattning

This work examines the role of both cross-lingual zero-shot learning and data augmentation in detecting hate speech online for low resource set-ups. The proposed solutions for situations where the amount of labeled data is scarce are to use a language with more resources during training or to create synthetic data points. Cross-lingual zero-shot results suggest some knowledge transfer is occurring. However, results seem greatly influenced by the specific training data set selected. This is further supported by cross-data set experimentation within the same language, where results were also found to fluctuate based on training data without the need for cross-lingual transfer. Meanwhile, data augmentation methods show an improvement, especially for low amounts of data. Furthermore, a detailed discussion on how the proposed data augmentation techniques impact the data is presented in this work.

Examinationsnivå

Student essay

Datum

2020-07-08

Författare

Rodriguez, David

Saynova, Denitsa

Nyckelord

machine learning

natural language processing

BERT

cross-lingual zeroshot learning

data augmentation

hate speech

classification

Twitter

Serie/rapportnr.

CSE 20-16

Språk

eng

Metadata

Visa fullständig post