Correlation Approach to Identify Coding Regions in DNA Sequences
S.M. Ossadnik, S.V. Buldyrev, A.L. Goldberger, S. Havlin, R.N. Mantegna,
C.K. Peng, M. Simons, H.E. Stanley
Biophysical Journal, 67, 64--70 (July 1994)
Abstract
Recently, it was observed that noncoding regions of DNA sequences
possess long-range power-law correlations, whereas coding regions
typically display only short-range correlations. We develop an
algorithm based on this finding that enables investigations to
perform a statistical analysis on long DNA sequences to locate possible
coding regions. The algorithm is particularly successful in
predicting the location of lengthy coding regions. For example, for
the complete genome of yeast chromosome III (315,344 nucleotides),
at least 82% of the predictions correspond to putative coding
regions; the algorithm correctly identified all coding regions
larger than 3000 nucleotides, 92% of coding regions between 2000
and 3000 nucleotides long, and 79% of coding regions between 1000
and 2000 nucleotides. The predictive ability of this new algorithm
support the claim that there is a fundamental difference in the
correlation property between coding and noncoding sequences. This
algorithm, which is not species-dependent, can be implemented with
other techniques for rapidly and accurately locating relatively
long coding regions in genomic sequences.