Development of a machine learning algorithm classification tool to improve strain detection in whole genome metagenomics dataset
Autor/a
Otros/as autores/as
Fecha de publicación
2020-05-15Resumen
Metagenomics is a pioneering branch of bioinformatics that utilizes genomics techniques, like the sequencing of the DNA, in order to obtain important information about microorganisms. During the recent years, scientists strongly focused on this innovative field, highlighting its importance in the clinical area, as well as in the environmental one. In this respect, the lack of user – friendly software that allow metagenomes’ analysis has become an important issue. GAIA is a bioinformatics tool, developed by Sequentia Biotech, that is aimed to perform functional and taxonomical analyses of metagenomics data from both amplicon and whole genome sequencing data. As well as other software, GAIA has the ability to analyze data at strain level. However, one limitation of GAIA is the high number of false positives that can arise during this type of analysis. This is due to the high similarity existing between genomes of microorganisms from different strains of the same species. From this perspective, we worked on GAIA’s ability to taxonomically classify bacterial strains from their sequences. We benchmarked different machine learning classification models. Moreover, we had to handle the imbalanced data problem, a common machine learning issue, testing different methods and comparing them to each other. We finally find the best model using hyperparameters tuning technique. The results we obtained show a significant improvement in the accuracy of GAIA’s predictions.
Tipo de documento
Trabajo fin de máster
Versión del documento
Director/a: Serrat Jurado, Josep Maria
Lengua
Inglés
Palabras clave
Genòmica
Algorismes genètics
Bioinformàtica
Páginas
32 p.
Nota
Curs 2019-2020
Este ítem aparece en la(s) siguiente(s) colección(ones)
Derechos
Tots els drets reservats