Development of a machine learning algorithm classification tool to improve strain detection in whole genome metagenomics dataset
Autor/a
Altres autors/es
Data de publicació
2020-05-15Resum
Metagenomics is a pioneering branch of bioinformatics that utilizes genomics techniques, like the sequencing of the DNA, in order to obtain important information about microorganisms. During the recent years, scientists strongly focused on this innovative field, highlighting its importance in the clinical area, as well as in the environmental one. In this respect, the lack of user – friendly software that allow metagenomes’ analysis has become an important issue. GAIA is a bioinformatics tool, developed by Sequentia Biotech, that is aimed to perform functional and taxonomical analyses of metagenomics data from both amplicon and whole genome sequencing data. As well as other software, GAIA has the ability to analyze data at strain level. However, one limitation of GAIA is the high number of false positives that can arise during this type of analysis. This is due to the high similarity existing between genomes of microorganisms from different strains of the same species. From this perspective, we worked on GAIA’s ability to taxonomically classify bacterial strains from their sequences. We benchmarked different machine learning classification models. Moreover, we had to handle the imbalanced data problem, a common machine learning issue, testing different methods and comparing them to each other. We finally find the best model using hyperparameters tuning technique. The results we obtained show a significant improvement in the accuracy of GAIA’s predictions.
Tipus de document
Treball fi de màster
Versió del document
Director/a: Serrat Jurado, Josep Maria
Llengua
Anglès
Paraules clau
Genòmica
Algorismes genètics
Bioinformàtica
Pàgines
32 p.
Nota
Curs 2019-2020
Aquest element apareix en la col·lecció o col·leccions següent(s)
Drets
Tots els drets reservats