PromFD 1.0: A Computer Program That Predicts Eukaryotic Pol
II Promoters Using Strings and IMD Matrices
Q.K. Chen, G.Z. Hertz, G.D. Stormo
Computer Applications in Biosciences
13(1), 29-35 (Feb 1997)
Abstract
MOTIVATION: A large number of new DNA sequences with
virtually unknown functions are generated as the Human Genome
Project progresses. Therefore, it is essential to develop computer
algorithms that can predict the functionality of DNA segments
according to their primary sequences, including algorithms that can
predict promoters. Although several promoter-predicting algorithms
are available, they have high false-positive detections and the rate of
promoter detection needs to be improved further. RESULTS: In this
research, PromFD, a computer program to recognize vertebrate RNA
polymerase II promoters, has been developed. Both vertebrate
promoters and non-promoter sequences are used in the analysis. The
promoters are obtained from the Eukaryotic Promoter Database.
Promoters are divided into a training set and a test set. Non-promoter
sequences are obtained from the GenBank sequence databank, and are
also divided into a training set and a test set. The first step is to search
out, among all possible permutations, patterns of strings 5-10 bp long,
that are significantly over-represented in the promoter set. The
program also searches IMD (Information Matrix Database) matrices
that have a significantly higher presence in the promoter set. The
results of the searches are stored in the PromFD database, and the
program PromFD scores input DNA sequences according to their
content of the database entries. PromFD predicts promoters-their
locations and the location of potential TATA boxes, if found. The
program can detect 71% of promoters in the training set with a
false-positive rate of under 1 in every 13,000 bp, and 47% of promoters
in the test set with a false-positive rate of under 1 in every 9800 bp.
PromFD uses a new approach and its false-positive identification rate
is better compared with other available promoter recognition
algorithms. The source code for PromFD is in the 'c+2' language.