Microorganisms such as bacteria can be hard to identify correctly. Most current
classification techniques are based on well conserved genes, for instance the 16S ribosomal
RNA (16S rRNA). Nevertheless, achieving a classifier with high accuracy
in classifying bacteria through 16S rRNA data is still a challenge. For this reason,
different machine learning approaches exploring a k-mer representation technique
can still contribute to solve this problem. Mapping the DNA sequences as vectors
in a ...
Microorganisms such as bacteria can be hard to identify correctly. Most current
classification techniques are based on well conserved genes, for instance the 16S ribosomal
RNA (16S rRNA). Nevertheless, achieving a classifier with high accuracy
in classifying bacteria through 16S rRNA data is still a challenge. For this reason,
different machine learning approaches exploring a k-mer representation technique
can still contribute to solve this problem. Mapping the DNA sequences as vectors
in a numerical space, by counting the frequency of each k-mer in a given sequence,
is essential to be able to train the machine learning algorithms. Two deep learning
models, Convolutional Neural Networks and Deep Belief Networks, as well as
a tree-based model, XGBoost, are trained with synthetic datasets with 16S rRNA
sequences. These synthetic datasets are the 16S rRNA shotgun (SG), and the amplicon
(AMP) which considers only specific 16S hypervariable regions. Comparing
the performance of these models with the synthetic datasets provides useful information.
Moreover, it is also relevant to explore how these models work with real data
available in public genomic databases (NCBI, SILVA and FDA-ARGOS). Analysing
the classifiers’ performance with real data contributes to give an estimation of the
reliability of both, classifiers and public genomic databases.
+