Middle-range clustering of nucleotides in genomes
Jan Mrazek and Jaroslav Kypr
Computer Application in Biosciences 11(2), 195--199 (1995)
Abstract
We propose a novel, transparent and very simple algorithm to
analyze middle-range correlations in genomic nucleotide sequences.
Analysis by this algorithm of the EMBL Nucleotide Sequence Database
demonstrates that all four nucleotides cluster in the genomic
nucleotide sequences of eukaryotes on the scale of several
hundred base pairs. In prokaryotes, the clustering is weak but
still evident. The non-dominant three bases are deficient in the
clusters, while A is the most deficient in the clusters of C, and
vice versa, and G is the most deficient nucleotide in the cluster
of T, and vice versa. The algorithm also detects CG islands,
extending over 1kb, in vertebrate sequences. In plants, the CG
islands are shown to be much smaller, if they exist at all. A
clustering tendency is also exhibited by the TA doublet. Other
doublets do not cluster. We observe no strong correlation between
nucleotides separated in genomes by > 1kb.