Building a Dictionary for Genomes: Identification of
Presumptive Regulatory Sites by Statistical Analysis
Harmen J Bussemaker*, Hao Li**, and Eric D Siggia
Center for Studies in Physics and Biology, The Rockefeller University, Box
25, 1230 York Avenue, New York, NY 10021
* Present address: Swammerdam Institute for Life Sciences and
Amsterdam Center for Computational Science, University of Amsterdam,
Kruislaan 318, 1098 SM Amsterdam, The Netherlands.
E-mail: bussemaker@bio.uva.nl,
haoli@haoli1.ucsf.edu, or
siggia@eds1.rockefeller.edu.
** Present address: Departments of Biochemistry and Biophysics, University
of California, San Francisco, CA 94143.
Proceedings of National Academy of Sciences,
97(18):10096-10100 (2000).
Abstract
The availability of complete genome sequences and mRNA expression data for all genes
creates new opportunities and challenges for identifying DNA sequence motifs that
control gene expression. An algorithm, "MobyDick," is presented that decomposes a set
of DNA sequences into the most probable dictionary of motifs or words. This method is
applicable to any set of DNA sequences: for example, all upstream regions in a genome or
all genes expressed under certain conditions. Identification of words is based on a probabilistic segmentation
model in which the significance of longer words is deduced from the frequency of shorter ones of various
lengths, eliminating the need for a separate set of reference data to define probabilities. We have built a
dictionary with 1,200 words for the 6,000 upstream regulatory regions in the yeast genome; the 500 most
significant words (some with as few as 10 copies in all of the upstream regions) match 114 of 443 experimentally
determined sites (a significance level of 18 standard deviations). When analyzing all of the genes up-regulated
during sporulation as a group, we find many motifs in addition to the few previously identified by analyzing the
subclusters individually to the expression subclusters. Applying MobyDick to the genes derepressed when the
general repressor Tup1 is deleted, we find known as well as putative binding sites for its regulatory partners.