Compositional Segmentation and Long-range Fractal Correlations in DNA Sequences

Pedro Bernaola-Galvan, Ramon Roman-Roldan, Jose L. Oliver

Physical Review E to appear (May 1996)

Abstract

A segmentation algorithm based on the Jensen-Shannon entropic divergence is used to decompose long-range correlated DNA sequences into statistically-significant, compositionally-homogeneous patches. By adequately setting the significance level for segmenting the sequence, the underlying power-law distribution of patch lengths can be revealed. Some of the identified DNA domains were uncorrelated, but most of them continued displaying long-range correlations even after several steps of recursive segmentation, thus indicating a complex multi-length-scaled structure for the sequence. On the other hand, by separately shuffling each segment, or by randomly rearranging the order in which the different segments occur in the sequence, shuffled sequences preservating the original statistical distribution of patch lengths were generated. Both types of random sequences displayed the same correlation scaling exponents as the original DNA sequences, thus demonstrating that neither the internal structure of patches nor the order in which these are arranged in the sequence are critical; therefore, long-range correlations in nucleotide sequences seem to rely only on the power-law distribution of patch lengths.