Promotech: a universal tool for promoter detection in bacterial genomes

Loading...
Thumbnail Image

Keywords

Promoter, Machine Learning, Bacterial Genome, Promoter Detection, Bioinformatic

Degree Level

masters

Advisor

Degree Name

M. Sc.

Volume

Issue

Publisher

Memorial University of Newfoundland

Abstract

A promoter is a genomic sequence where the transcription machinery binds to start copying a gene into an RNA molecule. Finding the location of bacterial promoter sequences is essential for microbiology since promoters play a central role in regulating gene expression. There are several tools to recognize promoters in bacterial genomes; however, most of them were trained on data from a single bacterium or a specific set of sigma factors. Promotech was developed to overcome this limitation, offering a machine-learning-based classifier trained to generate a model that generalizes and detects promoters in a wide range of bacterial species. During the study, two model architectures were tested, Random Forest and Recurrent Networks. The Random Forest model, trained with promoter sequences with a binary encoded representation of each nucleotide, achieved the highest performance across nine different bacteria and was able to work with short 40bp sequences and entire bacterial genomes using a sliding window. The selected model was evaluated on a validation set of four bacteria not used during training, having 50% positive and 50% negative promoter sequences resulting in an average AUPRC of 0.73±0.13 and an AUROC of 0.71±0.13. The Random Forest model achieved an average AUPRC and AUROC across the validation set's entire genomes of 0.14±0.1 and 0.71±0.17, but increased its performance to 0.75±0.18 AUPRC and 0.90±0.06 AUROC when it was configured to detect promoter clusters. Promotech was compared against state-of-the-art bacterial promoter detection programs using the balanced data set and outperformed these methods.

Collections