Bioinformatics pipeline development to support Helicobacter pylori genome analysis
Abstract
Helicobacter pylori is a bacterium related to a variety of diseases and is a major
risk factor for gastric cancer [1]. There can be differences in the genomes of H.
pylori bacteria that are isolated from different patient groups and this project is
motivated by the desire of biologists to investigate how these differences correlate
with differences in disease. The development of high-throughput technologies in biological
science has lead to big data phenomena. Next generation sequencing (NGS)
is an example of such techniques which generate large text-based files, to store short
fragments of the whole genome sequence data of an organism quickly and at relatively
low-cost [2]. Achieving methods for analysis and management of such
complex datasets has emerged as a challenge which is the subject of this
thesis.
Using bioinformatics approaches, we proposed a pipeline for data analysis
and management on two different platforms: High-performance computing
and online workflow management system. For High-performance computing,
we used a computer cluster from C3SE, a centre for scientific and technical
computing at Chalmers University of Technology. On this platform, we developed a
pipeline by scripting using perl programming language. The first step in the pipeline
is error removal and quality control. Next is to find overlaps between the short
fragments of genome sequence and then merging them into continuous, longer sequences.
This called de novo genome assembly. The final step is genome annotation,
the process of transferring biological information from experimentally characterized
datasets or reference genomes to newly sequenced genomes. For each step in the
pipeline, we used benchmarking techniques to find the best programs that are developed
in the bioinformatics community and the case of a missing application we
implemented it. The result of the pipeline is well characterised, biologically annotated
datasets that are ready for analysis by biologists. For workflow management
system, we chose a widely used bioinformatics workflow management system, the
Galaxy project. Galaxy provides infrastructure for creating workflows and uploading
datasets via a user interface. With Galaxy, we managed to implement a pipeline
including quality control and de novo genome assembly.
Using the first method, we succeeded to analyse 52 datasets, and this
project is the first study that on a significant scale, explores H. pylori
and its association with gastric cancer.
Degree
Student essay
Collections
View/ Open
Date
2016-09-20Author
SHAGHAYEGH HOSSEINI, SEYEDEH
Keywords
Workflow management system
Pipeline development
Bioinformatics
High-Performance Computing
Language
eng