About
This web was created in 1996 to provide a collection of
publications on computational gene recognition, mainly
based on statistical differences of coding and non-coding
sequences. With the complete sequence of many genomes available,
including the human genome sequence, computational gene recognition
becomes routine, and many non-statistical (i.e. biological)
information is used for such a prediction. It is more and more
difficult to decide what topics and what papers should be
included, and can be of general interests. The first expansion of
the topic was to include promoter and regulatory region
recognition, which was already a deviation from the original
plan to include only coding-noncoding recognition papers.
A second expansion of the topic is to the use of microarray
technology to monitor the expression of genes. A third expansion
is gene annotation, which is a compilation of genes with
their classifications and functions. And a fourth expansion
is on alternative splicing.
Let's summary the original coding-noncoding distinction topic.
How can such discrimination be made?
- We ask where the genes are unlikely to be located.
[this is the strategy of excluding inter-genic regions]
- We ask how the transcription factors know where to bind
a DNA region? [consensus patterns in promoter region, CpG islands, etc.]
- We ask how transcription, splicing, translation find
their respective signals on the DNA sequence?
[searching for the start codon, the stop codon, splicing sites, etc.]
[be aware of the non-universal genetic code:
]
http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c]
- We ask what coding regions do "to make a living"?
They make proteins! Since three nucleotides code for one
amino acids, there is a distinct periodicity-three pattern
in coding regions (but absent in noncoding regions).
[this is the detection of periodicity-of-three signal
strategy, and much of the earlier development in this field]
- We ask whether we can learn from examples, especially
if these examples are obtained from the same species. [the
codon usage strategy follows this direction]
- We ask whether we can "cheat" by looking up a "dictionary"
(database of the known coding sequences). [this is sequence
similarity search]
More challenges:
- The first and the last exon may contain untranslated
regions (so-called 5' UTR and 3' UTR, or UTS where S stands
for sequence) and the coding signal there can be very weak.
- Small exons are harder to predict.
- Since alternative splicing in human genome is not
uncommon, we face the problem of non-unique solution of
gene recognition.
If you see these labels, it (roughly) means...
|
|
promoter recognition, or more generally, any regulatory elements recognition..
|
|
|
using expression data (e.g. microarray) to study the gene regulation
|
|
|
gene annotation of complete genomes
|
|
|
translation regulation, untranslated sequence, translation start/termination
|
|
|
using sequence similarity (homology) for gene recognition, comparative genome analysis
|
|
|
splicing (but i separate the "alternative splicing" in a different label)
|
|
|
alternative splicing
|