Exploit Unlabeled Data with Language Model for Text Classification. Comparison of four unsupervised learning models
Abstract
Within a situation where Semi-Supervised Learning (SSL) is available to exploit unlabeled data, this paper shows that Language Model (LM) outperforms the three models in text classification, which three models are based on Term-Frequency Inverse Document Frequency (Tf-idf) and two pre-trained word vectors. The experimental results show that the LM outperforms the other three unsupervised learning models whether the task is easy or difficult, which the difficult task consists of imbalanced data.
To investigate not only how the LM outperforms the other models but also how to maximize the performance of the LM in a small quantity of labeled data, this paper suggests two techniques to improve the performance of the LM in neural networks: (1) obtaining information from the neural network layers and (2) employing a proper evaluation for trained neural networks models.
Finally, this paper explores the various scenarios where SSL is not available, but only Transfer Learning (TL) is accessible to exploit unlabeled data. With two types of Self-Taught Learning and Multi-Tasks in TL, the results of the experiments show that exploiting dataset which has wider domain benefits the performance of the LM.
Degree
Student essay
Collections
View/ Open
Date
2018-10-29Author
Yang, Sung-Min
Keywords
Text classification
Semi-supervised learning
Unsupervised learning
Transfer learning
Natural Language Processing
Publication type
H2
Language
eng