Intrinsic and Extrinsic Approaches for Detecting Genes in a Bacterial Genome
M. Borodovsky, K.E. Rudd, E.V. Koonin
Nucleic Acids Research, 22(22):4756-4767 (Nov 11,1994)
Abstract
The unannotated regions of the Escherichia coli genome DNA sequence
from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of
the combined length of 359,279 base pairs, were analyzed using
computer-assisted methods with the aim of identifying putative
unknown genes. The proposed strategy for finding new genes includes
two key elements: i) prediction of expressed open reading frames
(ORFs) using the GeneMark method based on Markov chain models for
coding and non-coding regions of Escherichia coli DNA, and ii)
search for protein sequence similarities using programs based on the
BLAST algorithm and programs for motif identification. A total of
354 putative expressed ORFs were predicted by GeneMark. Using the
BLASTX and TBLASTN programs, it was shown that 208 ORFs located in
the unannotated regions of the E. coli chromosome are significantly
similar to other protein sequences. Identification of 182 ORFs as
probable genes was supported by GeneMark and BLAST, comprising 51.4%
of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative
new genes, comprising 20.6% of the GeneMark predictions, belong to
ancient conserved protein families that include both eubacterial and
eukaryotic members. This value is close to the overall proportion of
highly conserved sequences among eubacterial proteins, indicating
that the majority of the putative expressed ORFs that are predicted
by GeneMark, but have no significant BLAST hits, nevertheless are
likely to be real genes. The majority of the putative genes
identified by BLAST search have been described since the release of
the EcoSeq6 database, but about 70 genes have not been detected so
far. Among these new identifications are genes encoding proteins
with a variety of predicted functions including dehydrogenases,
kinases, several other metabolic enzymes, ATPases, rRNA
methyltransferases, membrane proteins, and different types of
regulatory proteins.