Development of a machine learning algorithm classification tool to improve strain detection in whole genome metagenomics dataset
Author
Other authors
Publication date
2020-05-15Abstract
Metagenomics is a pioneering branch of bioinformatics that utilizes genomics techniques, like the sequencing of the DNA, in order to obtain important information about microorganisms. During the recent years, scientists strongly focused on this innovative field, highlighting its importance in the clinical area, as well as in the environmental one. In this respect, the lack of user – friendly software that allow metagenomes’ analysis has become an important issue. GAIA is a bioinformatics tool, developed by Sequentia Biotech, that is aimed to perform functional and taxonomical analyses of metagenomics data from both amplicon and whole genome sequencing data. As well as other software, GAIA has the ability to analyze data at strain level. However, one limitation of GAIA is the high number of false positives that can arise during this type of analysis. This is due to the high similarity existing between genomes of microorganisms from different strains of the same species. From this perspective, we worked on GAIA’s ability to taxonomically classify bacterial strains from their sequences. We benchmarked different machine learning classification models. Moreover, we had to handle the imbalanced data problem, a common machine learning issue, testing different methods and comparing them to each other. We finally find the best model using hyperparameters tuning technique. The results we obtained show a significant improvement in the accuracy of GAIA’s predictions.
Document Type
Master's final project
Document version
Director/a: Serrat Jurado, Josep Maria
Language
English
Keywords
Genòmica
Algorismes genètics
Bioinformàtica
Pages
32 p.
Note
Curs 2019-2020
This item appears in the following Collection(s)
Rights
Tots els drets reservats