TOPIC MODELING FOR ANALYSIS OF PUBLIC DISCOURSE -Enriching topic modeling with linguistic information to analyze Swedish housing policies
TOPIC MODELING FOR ANALYSIS OF PUBLIC DISCOURSE -Enriching topic modeling with linguistic information to analyze Swedish housing policies
Sammanfattning
This work investigates how the method of topic modeling can be applied to investigate the public discourse of Swedish housing policies. The data used to represent this discourse is both from the Swedish
parliament, the Riksdag, and Swedish newstexts. The lack of housing and current housing crisis in Sweden
makes this a relevant area to study. Topic modeling is an unsupervised probabilistic method for finding topics in large collections of data. This is a popular method for examining public discourse, however
there is a lack of including linguistic information in the preprocessing steps of it. Therefore, this
work also investigates what effect linguistically informed preprocessing has on topic modeling.
Three types of linguistic information are selected and investigated. These are part of speech, dependency
relations and lemmatization. Based on these, filters are created for the data. The filters are applied to a
test set (a subset of the original data), and a topic model is trained on each filtered version of this test
set. The resulting topics from each model are evaluated by both humans and the computational methods perplexity and semantic coherence, and the results from the respective evaluation methods are compared.
The semantic coherence named cv is found to have a higher correlation with human ratings than the npmi
coherence. Perplexity is found to not correlate well with human ratings.
Filtering the data based on part of speech is found to most improve the topic quality. Non-lemmatized
topics are found to be rated higher than lemmatized topics. Topics from the filters based on dependency
relations are found to have low ratings.
Based on the human ratings, an optimum model for respective data set is chosen. The selected topic
models are applied to the data, and the results are used for to exemplify how one can use them for analysis.
Topic modeling is found to be a suitable method for the intended analysis.
Examinationsnivå
Student essay
Samlingar
Fil(er)
Datum
2018-01-15Författare
Lindahl, Anna
Nyckelord
topic modeling
public discourse
housing policies
LDA
semantic coherence measures
part of speech
Publikationstyp
H2
Språk
eng