Identification of Coding Regions in Genomic DNA
E.E. Snyder and G.D. Stormo
Journal of Molecular Biology, 248:1-18 (1995)
Abstract
We have developed a computer program, GeneParser, which identifies and determines
the fine structure of protein genes in genomic DNA sequences. The program scores
all subintervals in a sequence for content statistics indicative of introns and
exons and for sites which identify their boundaries. This information is
weighted by a neural network to approximate the log-likelihood that each subinterval
exactly represents an intron or exon (first, internal or last). A dynamic
programming (DP) algorithm is then applied to this data to find the combination
of introns and exons which maximizes the likelihood function. Using this method,
we can rapidly generate ranked suboptimal solutions, each of which is the
optimum solution containing a given intron-exon junction. We have tested the
system on a large collection of human genes. On sequences not used in training,
we achieved a correlation coefficient for exon nucleotide prediction of 0.89.
For a subset of G+C rich genes, a correlation coefficient of 0.94 was achieved.
We have also quantitated the robustness of the method to substitution and
frame-shift errors and show how the system can be optimized for performance
on sequences with known levels of sequencing errors.