Text Analysis - Exploring latent semantic models for information retrieval, topic modeling and sentiment detection
Abstract
With the increasing use of the Internet and social media, the amount of
available data has exploded. As most of this data is natural language text,
there is a need for efficient text analysis techniques which enable extraction
of useful data. This process is called text mining, and in this thesis some of
these techniques are evaluated for the purpose of integrating them into the
visual data mining software TIBCO Spotfire®.
In total, five analysis models with different running time, memory use and
performance have been analyzed, implemented and evaluated. The tf-idf vector
space model was used as a baseline. It can be extended using Latent Semantic
Analysis and random projection to find latent semantic relationships
between documents. Finally, Latent Dirichlet Allocation (LDA), Joint Sentiment/
Topic model (JST) and Sentiment Latent Dirichlet Allocation (SLDA)
are used to extract topics. The latter two are extensions to LDA which also
detects positive and negative sentiment.
Evaluation was done using the perplexity measure for topic modeling, average
precision for searching and classification accuracy of positive and negative
reviews for the sentiment models. It was concluded that for searching, a
vector space model with tf-idf weighting had similar performance compared
to the latent semantic models for the test corpus used. Topic modeling
showed to provide useful output, however at the expense of running time. The
JST and SLDA sentiment detectors showed a small improvement compared to
a baseline word counting classifier, especially for a multiple domain dataset.
Finally it was shown that they had mixed sentiment classification accuracy
from run to run, indicating that further investigation is motivated.
Degree
Student essay
Collections
View/ Open
Date
2011-12-07Author
Jalsborn, Erik
Luotonen, Adam
Language
eng