Abstract of the Dissertation
Metagenomics is revolutionizing microbial ecology and has unlocked unprecedented opportunities in many domains of Life science. For instance, metagenomics has allowed the discovery of new forms of life in unexplored habitats in the marine environment. In medicine, metagenomics can help for accurate and faster diagnosis than standard laboratory procedures. In the context of pathogen surveillance in public health or biosurveillance, it was successfully applied with limited resources to monitor outbreaks in epidemic areas.
As sequencing technologies have considerably improved in speed and cost over the past decade, the number of reference sequences in public databases is exponentially growing and thus faster while accurate and efficient computational methods are needed for analyzing these large data. The research presented in this dissertation focuses on (i) how to build faster, more accurate and more efficient sequence classification methods to determine the microbial composition of metagenomic samples (i.e., the CLARK series), and (ii) how to infer/recover the microbial composition in missing or contaminated data in the context of a city-scale biosurveillance for example.
Our classification system is composed of several tools, namely CLARK, CLARK-l and CLARK-S, which are already used by several reserch teams worldwide as state-of-the-art methods. While CLARK is able to perform with high accuracy sequence classification with unprecedented speed, CLARK-S is a variant of CLARK and can achieve with high speed a higher accuracy than CLARK. We also show that the new sequence analysis methods are versatile and applicable to several contexts of sequence classification, for example, for BAC clones in the context of the barley genome.
Keywords: Microbiome, metagenomics, genomics, comparative microbiomics, classification, prediction, inference, sequence analysis, light-weight algorithm, k-mers, discriminative spaced k-mers, target-specific k-mers.