Analysis of DNA Sequences
B.S. Weir
Statistical Methods in Medical Research, 2(3),225-239 (1993)
Abstract
Recent developments in the statistical analysis of DNA sequences are reviewed. The pace with which
sequence data are being generated and analysed has increased with the growth of the human genome project. Two
areas of activity are emphasized: attention to error rates in recorded sequences, and heterogeneity in structure of
sequences. There is now empirical evidence suggesting error rates in the range 0.1%-1%, and such rates will
affect evolutionary studies since these are about the rates at which DNA sequences from different individuals are
expected to differ. Heterogeneity for such quantities as base composition, or lengths between successive
subsequences of specified types, may be sufficient to account for observed long-range correlations between
bases. The need for statistical models and analyses of DNA sequence data will continue, and will offer interesting
challenges.