Detection of New Genes in a Bacterial Genome Using Markov Models for
Three Gene Classes
Mark Borodovsky, James D. McIninch, Eugene V. Koonin, Kenneth E. Rudd, Claudine Midigue,
and Antoine Danchin
Nucleic Acids Research, 23(17):3554-3562 (Sept 11,1995)
Abstract
We further investigated the statistical features of the three classes of Escherichia coli genes that
have been previously delineated by factorial correspondence analysis and dynamic clustering
methods. A phased Markov model for a nucleotide sequence of each gene class was developed and
employed for gene prediction using the GeneMark program. The protein-coding region prediction
accuracy was determined for class-specific Markov models of different orders when the programs
implementing these models were applied to gene sequences from the same or other classes. It is
shown that at least two training sets and two program versions derived for different classes of E.coli
genes are necessary in order to achieve a high accuracy of coding region prediction for
uncharacterized sequences. Some annotated E.coli genes from Class I and Class III are shown to be
spurious, whereas many open reading frames (ORFs) that have not been annotated in GenBank as
genes are predicted to encode proteins. The amino acid sequences of the putative products of these
ORFs initially did not show similarity to already known proteins. However, conserved regions have
been identified in several of them by screening the latest entries in protein sequence databases and
applying methods for motif search, while some other of these new genes have been identified in
independent experiments.