Sequential Anomaly Detection for Log Data Using Deep Learning
Abstract
Abstract
Software development with continuous integration changes needs frequent testing for
assessment. Analyzing the test output manually is time-consuming and automating
this process could be beneficial to an organization. The goal of this thesis project is
to do the automated anomaly detection analysis of software test output files provided
by Volvo Group Trucks Technology, to achieve this we evaluated four different neural
network architectures. The four neural network architectures are two recurrent
neural networks with long short-term memory (LSTM) where one is unidirectional
and one is bidirectional as well as two autoencoders (an LSTM-based sequence-tosequence
model and a Transformer) that aim to reconstruct a sequence from the files.
In order to evaluate the performance of the neural network architectures two datasets
were utilized. The first dataset is from the Hadoop Distributed File System (HDFS)
and this is a publicly available dataset where all logs are labelled as either anomalous
or non-anomalous. The second dataset are log files resulting from software testing
provided by Volvo Group Trucks Technology which contain no labels. The networks
were evaluated in two different settings when trained on the HDFS data. In the first
setting the logs labelled as anomalous were filtered out making it a semi-supervised
approach and in the second setting the logs labelled as anomalous were kept which
makes it an unsupervised approach. Lastly the networks were trained on the data
provided by Volvo Group Trucks Technology which is unlabeled, the objective of
approach is to evaluate how the networks perform in an unsupervised setting. In
addition, an analysis of the size of the data sets used to train the networks were
performed.
The results show that for the data provided by Volvo Group Trucks Technology the
size of the dataset used for training the networks influenced the performance of the
anomaly detection where a smaller dataset performed better than a larger dataset.
Moving on to the HDFS dataset, a smaller dataset for the unsupervised setting was
also better than a larger dataset. However, for the HDFS data the semi-supervised
approach outperformed the unsupervised setting regardless of the size of the training
dataset.
Degree
Student essay
Collections
Date
2021-06-14Author
Hammargren, Lina
Wu, Wei
Keywords
anomaly detection, recurrent neural network, long short-term memory, semi-supervised learning, seq2seq, transformer, unsupervised learning, log analysis
Language
eng