Promotech: a universal tool for promoter detection in bacterial genomes
Files
Date
Authors
Keywords
Degree Level
Advisor
Degree Name
Volume
Issue
Publisher
Abstract
A promoter is a genomic sequence where the transcription machinery binds to start copying a gene into an RNA molecule. Finding the location of bacterial promoter sequences is essential for microbiology since promoters play a central role in regulating gene expression. There are several tools to recognize promoters in bacterial genomes; however, most of them were trained on data from a single bacterium or a specific set of sigma factors. Promotech was developed to overcome this limitation, offering a machine-learning-based classifier trained to generate a model that generalizes and detects promoters in a wide range of bacterial species. During the study, two model architectures were tested, Random Forest and Recurrent Networks. The Random Forest model, trained with promoter sequences with a binary encoded representation of each nucleotide, achieved the highest performance across nine different bacteria and was able to work with short 40bp sequences and entire bacterial genomes using a sliding window. The selected model was evaluated on a validation set of four bacteria not used during training, having 50% positive and 50% negative promoter sequences resulting in an average AUPRC of 0.73±0.13 and an AUROC of 0.71±0.13. The Random Forest model achieved an average AUPRC and AUROC across the validation set's entire genomes of 0.14±0.1 and 0.71±0.17, but increased its performance to 0.75±0.18 AUPRC and 0.90±0.06 AUROC when it was configured to detect promoter clusters. Promotech was compared against state-of-the-art bacterial promoter detection programs using the balanced data set and outperformed these methods.
