Gene Recognition Via Spliced Sequence Alignment
Mikhail S. Gelfand, Andrey A. Mironov, Pavel A. Pevzner
Proceedings of the National Academy of Sciences ,
93(17): 9061-9066 (August 20, 1996)
Abstract
Gene recognition is one of the most important problems in computational molecular
biology. Previous attempts to solve this problem were based on statistics, and
applications of combinatorial methods for gene recognition were almost unexplored.
Recent advances in large-scale cDNA sequencing open a way toward a new approach to
gene recognition that uses previously sequenced genes as a clue for recognition of
newly sequenced genes. This paper describes a spliced alignment algorithm and
software tool that explores all possible exon assemblies in polynomial time and
finds the multiexon structure with the best fit to a related protein. Unlike other
existing methods, the algorithm successfully recognizes genes even in the case
of short exons or exons with unusual codon usage; we also report correct assemblies
for genes with more than 10 exons. On a test sample of human genes with known
mammalian relatives, the average correlation between the predicted and actual
proteins was 99%. The algorithm correctly reconstructed 87% of genes and the rare
discrepancies between the predicted and real exon-intron structures were caused either
by short (less than 5 amino acids) initial/terminal exons or by alternative splicing.
Moreover, the algorithm predicts human genes reasonably well when the homologous protein
is nonvertebrate or even prokaryotic. The surprisingly good performance of the method
was confirmed by extensive simulations: in particular, with target proteins at
160 accepted point mutations (PAM) (25% similarity), the correlation between the
predicted and actual genes was still as high as 95%.