Estimation of Protein Coding Density in a Corpus of DNA Sequence Data
J.W. Fickett, R. Guigo
Nucleic Acids Research, 21(12):2837-2844 (Jun 25, 1993)
Abstract
A number of experimental methods have been reported for estimating
the number of genes in a genome, or the closely related coding
density of a genome, defined as the fraction of base pairs in
codons. Recently, DNA sequence data representative of the genome as
a whole have become available for several organisms, making the
problem of estimating coding density amenable to sequence analytic
methods. Estimates of coding density for a single genome vary
widely, so that methods with characterized error bounds have become
increasingly desirable. We present a method to estimate the protein
coding density in a corpus of DNA sequence data, in which a 'coding
statistic' is calculated for a large number of windows of the
sequence under study, and the distribution of the statistic is
decomposed into two normal distributions, assumed to be the
distributions of the coding statistic in the coding and noncoding
fractions of the sequence windows. The accuracy of the method is
evaluated using known data and application is made to the yeast
chromosome III sequence and to C. elegans cosmid sequences. It can
also be applied to fragmentary data, for example a collection of
short sequences determined in the course of STS mapping.