Taxonomic classification of metagenomic reads using machine learning models

Enllaç permanent

Descripció

  • Resum

    Microorganisms such as bacteria can be hard to identify correctly. Most current classification techniques are based on well conserved genes, for instance the 16S ribosomal RNA (16S rRNA). Nevertheless, achieving a classifier with high accuracy in classifying bacteria through 16S rRNA data is still a challenge. For this reason, different machine learning approaches exploring a k-mer representation technique can still contribute to solve this problem. Mapping the DNA sequences as vectors in a numerical space, by counting the frequency of each k-mer in a given sequence, is essential to be able to train the machine learning algorithms. Two deep learning models, Convolutional Neural Networks and Deep Belief Networks, as well as a tree-based model, XGBoost, are trained with synthetic datasets with 16S rRNA sequences. These synthetic datasets are the 16S rRNA shotgun (SG), and the amplicon (AMP) which considers only specific 16S hypervariable regions. Comparing the performance of these models with the synthetic datasets provides useful information. Moreover, it is also relevant to explore how these models work with real data available in public genomic databases (NCBI, SILVA and FDA-ARGOS). Analysing the classifiers’ performance with real data contributes to give an estimation of the reliability of both, classifiers and public genomic databases.
  • Descripció

    Treball fi de màster de: Master in Intelligent Interactive Systems
    Tutors: Mario Ceresa, Antonio Puertas, Vicenç Gómez
  • Mostra el registre complet