The prediction of vertebrate promoter regions using differential hexamer frequency analysis

GB. Hutchinson
Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada and
Correspondence address: c/o RabbitHutch Biotechnology Corporation, PO Box 506, 108 Mile Ranch, British Columbia, V0K 2Z0, Canada

Computer Applications in Bioscience, 12(5), 391-398 (Oct 1996)

Abstract

Motivation. To develop an algorithm utilizing differential hexamer frequency analysis to discriminate promoter from non-promoter regions in vertebrate DNA sequence, without relying upon extensive database of known transcriptional elements.

Results. By determining hexamer frequencies derived from known promoter regions, coding regions and non-coding regions in vertebrates' DNA sequence, and a formula first applied by Claverie and Bougueleret (1986), a discriminant measure was created that compares promoter regions with coding (D1) and non-coding (D2) sequence. The algorithm is able to identify correctly the promoter regions in 18 of 29 loci (62.1%) from an independent test data set. With program options set to identify only one promoter region in the forward strand, there are 11 false-positive in 18 974 single-stranded bp). With options set to analyze sequence in discrete segments, there is no appreciable improvement in sensitivity, whereas the specificity falls off predictably. It is of particular interest than a search for a peak score (independent of an absolute threshold) is more accurate that a search based upon a fixed scoring threshold. This suggests that the selection of promoter sites may be influenced by the global properties of an entire sequence domain, rather than exclusively upon local characteristics.

Availability. A binary-executable, MS-DOS version of PromFind is available free of charge by anonymous ftp, address: iubio.bio.indiana.edu, directory: molbio/ibmpc.

Contact. E-mail: hutch@netshop.bc.ca