Distinctive Sequence Features in Protein Coding Genic Non-coding,
and Intergenic Human DNA
Roderic Guigo and James W. Fickett
Journal of Molecular Biology, 253:51-60 (1995)
Abstract
We have studied the behavior of a number of sequence statistics,
mostly indicative of protein coding function, in a large set of
human clone sequences randomly selected in the course of genome
mapping (randomly selected clone sequences), and compared this
with the behavior in known sequences containing genes (which we
term genic sequences). As expected, given the higher coding
density of the genic sequences, the sequence statistics studied
behave in a substantially different manner in the genic sequences,
suggesting that intergenic and genic non-coding DNA constitute
two different classes of non-coding DNA. By studying the behavior
of the sequence statistics in simulated DNA of different C+G
content, we have observed that a number of them are strongly
dependent on C+G content. Thus, most differences between
intergenic and genic non-coding DNA can be explained by
compositional equilibrium expected under random mutation,
which C+G richer non-coding genic DNA is far from this equilibrium.
The results obtained in simulated DNA indicate, on the other hand,
that a very large fraction of the variation in the coding
statistics that underlie gene identification algorithms is due
simply to C+G content, and is not directly related to protein
coding function. It appears, thus, that the performance of
gene-finding algorithms should be improved by carefully
distinguishing the effects of protein coding function from
those of mere base compositional variation on such coding statistics.