Some useful statistical properties of position-weight matrices
Claverie JM
Computers & Chemistry 18(3):287-294 (1994 Sep)
Abstract
Position-weight matrices (or profiles) are simple mathematical
objects traditionally used to capture the information about local
sequence patterns (or motifs) characteristic of a given structure or
function. Although weight matrices can lead to fast database
scanning algorithms their usage has been limited, due to the lack of
a reliable method to assess the statistical significance of the
matching scores. In this article I first review 3 different
computation scheme for designing weight matrices from a
block-alignment of any (small or large) number of sequences. I then
show that, for patterns spanning 10 positions or more, the best
scores expected from matching random sequences are distributed
according to the extreme value (Gumbel) distribution. The threshold
of statistical significance assessed from this distribution
perfectly delineate the range of scores characterizing "true
positive" sequences (biological significant matches). This result
allows weight matrices to be used to scan an entire protein database
for patterns in a highly sensitive way. MODEST (MOtif DEsign and
Search Tools), a suite of programs in Unix/C, implements these
statistical improvements and is available upon E-mail request
(jmc:ncbi.nlm.nih.gov).