Compositional Segmentation and Long-range Fractal Correlations in
DNA Sequences
Pedro Bernaola-Galvan, Ramon Roman-Roldan, Jose L. Oliver
Physical Review E to appear (May 1996)
Abstract
A segmentation algorithm based on the Jensen-Shannon entropic divergence is
used to decompose long-range correlated DNA sequences into
statistically-significant, compositionally-homogeneous patches.
By adequately setting the significance level for segmenting the
sequence, the underlying power-law distribution of patch lengths can
be revealed. Some of the identified DNA domains were uncorrelated,
but most of them continued displaying long-range correlations even
after several steps of recursive segmentation, thus indicating
a complex multi-length-scaled structure for the sequence. On
the other hand, by separately shuffling each segment, or by
randomly rearranging the order in which the different segments
occur in the sequence, shuffled sequences preservating the original
statistical distribution of patch lengths were generated. Both types
of random sequences displayed the same correlation scaling exponents
as the original DNA sequences, thus demonstrating that neither
the internal structure of patches nor the order in which these
are arranged in the sequence are critical; therefore, long-range
correlations in nucleotide sequences seem to rely only on the
power-law distribution of patch lengths.