Sequential Anomaly Detection for Log Data Using Deep Learning

Hammargren, Lina; Wu, Wei

Abstract

Abstract Software development with continuous integration changes needs frequent testing for assessment. Analyzing the test output manually is time-consuming and automating this process could be beneficial to an organization. The goal of this thesis project is to do the automated anomaly detection analysis of software test output files provided by Volvo Group Trucks Technology, to achieve this we evaluated four different neural network architectures. The four neural network architectures are two recurrent neural networks with long short-term memory (LSTM) where one is unidirectional and one is bidirectional as well as two autoencoders (an LSTM-based sequence-tosequence model and a Transformer) that aim to reconstruct a sequence from the files. In order to evaluate the performance of the neural network architectures two datasets were utilized. The first dataset is from the Hadoop Distributed File System (HDFS) and this is a publicly available dataset where all logs are labelled as either anomalous or non-anomalous. The second dataset are log files resulting from software testing provided by Volvo Group Trucks Technology which contain no labels. The networks were evaluated in two different settings when trained on the HDFS data. In the first setting the logs labelled as anomalous were filtered out making it a semi-supervised approach and in the second setting the logs labelled as anomalous were kept which makes it an unsupervised approach. Lastly the networks were trained on the data provided by Volvo Group Trucks Technology which is unlabeled, the objective of approach is to evaluate how the networks perform in an unsupervised setting. In addition, an analysis of the size of the data sets used to train the networks were performed. The results show that for the data provided by Volvo Group Trucks Technology the size of the dataset used for training the networks influenced the performance of the anomaly detection where a smaller dataset performed better than a larger dataset. Moving on to the HDFS dataset, a smaller dataset for the unsupervised setting was also better than a larger dataset. However, for the HDFS data the semi-supervised approach outperformed the unsupervised setting regardless of the size of the training dataset.

Degree

Student essay

Date

2021-06-14

Author

Hammargren, Lina

Wu, Wei

Keywords

anomaly detection, recurrent neural network, long short-term memory, semi-supervised learning, seq2seq, transformer, unsupervised learning, log analysis

Language

eng

Metadata

Show full item record