Identification of Coding Regions in Genomic DNA Sequences:
An Application of Dynamic Programming and Neural Networks
E.E. Snyder, G.D. Stormo
Nucleic Acids Research, 21(3):607-613 (1993)
Abstract
Dynamic programming (DP) is applied to the problem of precisely identifying
internal exons and introns in genomic DNA sequences. The program GeneParser first
scores the sequence of interest for splice sites and for these intron- and
exon-specific content measures: codon usage, local compositional complexity, 6-tuple
frequency, length distribution and periodic asymmetry. This information is
then organized for interpretation by DP. GeneParser employs the DP algorithm
to enforce the constraints that introns and exons must be adjacent and
non-overlapping and finds the highest scoring combination of introns and
exons subject to these constraints. Weights for the various classification
procedures are determined by training a simple feed-forward neural network
to maximize the number of correct predictions. In a pilot study, the system
has been trained on a set of 56 human gene fragments containing 150 internal
exons in a total of 158,691 bps of genomic sequence. When tested against the
training data, GeneParser precisely identifies 75% of the exons and correctly
predicts 86% of coding nucleotides as coding while only 13% of non-exon bps
were predicted to be coding. This corresponds to a correlation coefficient
for exon prediction of 0.85. Because of the simplicity of the network
weighting scheme, generalization performance is nearly as good as with the
training set.