Finding Genes in DNA with a Hidden Markov Model
J.Henderson, S.Salzberg, and K.H.Fasman
Journal of Computational Biology, 4(2):127-142 (Spring 1997)
Abstract
This study describes a new Hidden Markov Model (HMM)
system for segmenting uncharacterized genomic DNA
sequences into exons, introns, and intergenic regions.
Separate HMM modules were designed and trained for
specific regions of DNA: exons, introns, intergenic
regions, and splice sites. The models were then tied together to
form a biologically feasible topology. The integrated HMM
was trained further on a set of eukaryotic DNA sequences
and tested by using it to segment a separate set of
sequences. The resulting HMM system which is called VEIL
(Viterbi Exon-Intron Locator), obtains an overall accuracy
on test data of 92% of total bases correctly labelled,
with a correlation coefficient of 0.73. Using the more
stringent test of exact exon prediction, VEIL correctly
located both ends of 53% of the coding exons, and 49% of
the exons it predicts are exactly correct. These results
compare favorably to the best previous results for gene
structure prediction and demostrate the benefits of using
HMMs for this problem.