Stochastic Models for Heterogeneous DNA Sequences
Gary A. Churchill
Bulletin of Mathematical Biology 51(1), 79--94 (1989)
Abstract
The composition of naturally occuring DNA sequences is often strikingly
heterogeneous. In this paper, the DNA sequence is viewed as a stochastic
process with local compositional properties determined by the states of
a hidden Markov chain. The model used is a discrete-state, discrete-outcome
version of a general model for non-stationary time series proposed by
Kitagawa (1987). A smoothing algorithm is described which can be used
to reconstruct the hidden process and produce graphic displays of the
compositional structure of a sequence. The problem of parameter
estimation is approached using likelihood methods and an EM algorithm
for approximating the maximum likelihood estimate is derived. The
methods are applied to sequences from yeast mitochondiral DNA, human
and mouse mitochondrial DNAs, a human X chromosomal fragment and the
complete genome of bacteriophage lambda.