Annoucement of a new version of DISEQ

(feb 22, 1996)


UPDATE ANNOUNCEMENT FOR LINKAGE DISEQUILIBRIUM ANALYSIS PROGRAMS

A new version of the programs DISLAMB and DISMULT for likelihood based linkage disequilibrium analysis is now available via ftp.

NEW PROGRAM VERSION 2-22-96 Updates include

  1. Improved maximization routine included after bug report from Michael Knapp

  2. Floating Point arithmetics streamlined following programming suggestions from Alejandro Schaffer

  3. Haplotype Relative Risk analysis programs included for disequilibrium analysis using LINKAGE pedigree files.

  4. Support Intervals for Lambda and Theta are now computed by DISLAMB.

  5. Cross system compatibility improved.

  6. Executables now available with statically linked libraries for SUNOS, SOLARIS, and OSF, so no PASCAL compiler is required - VMS version, however does require a PASCAL compiler installed on your machine.

If there are questions, refer them to

Joseph D. Terwilliger
Wellcome Trust Center for Human Genetics
Windmill Road
Headington OX3 7BN
Oxford, UK

FAX +44 - 1865 - 742 196
Ph. +44 - 1865 - 740 016
EMAIL joe.terwilliger@well.ox.ac.uk

OR

Joseph D. Terwilliger
Columbia University, Unit 58
722 West 168th Street
New York, NY 10032
USA

FAX +1 - 212 - 568-2750
Ph. +1 - 212 - 960-5863
EMAIL jdt3@columbia.edu

To get the programs and documentation for DEC and SUN machines under either UNIX or VMS, just do the following:

From OXFORD:

  1. ftp ftp.well.ox.ac.uk
  2. login as anonymous with your EMAIL address as the password
    (please leave your EMAIL address so I have a record of who has downloaded the programs)
  3. cd pub/genetics/diseq
  4. binary
  5. get FILENAME.tar.Z
  6. close
  7. quit

From NEW YORK

  1. ftp linkage.cpmc.columbia.edu
  2. login as anonymous with your EMAIL address as the password
    (please leave your EMAIL address so I have a record of who has downloaded the programs)
  3. cd software/diseq
  4. binary
  5. get FILENAME.tar.Z
  6. close
  7. quit

Then, on your own machine

uncompress FILENAME.tar.Z
tar xvf FILENAME.tar

There are a number of files available, as follows:

SYSTEM_diseq.tar.Z = source, executables, and documentation
SYSTEM_diseq_source.tar.Z = source and documentation

Files are available for the following SYSTEMs - OSF,SOLARIS, SUNOS, VMS. The SUNOS_diseq_source.tar.Z contains a fairly standard UNIX version of the code which could be recompiled with minimal or no changes (involving file handling if anything for some systems...) on your different UNIX system. If you download only the source files, then they will be in a directory SYSTEM_diseq/source. You shouls first create a directory SYSTEM_diseq/exec, and then to compile the programs, go to the SYSTEM_diseq/source directory and run the included shell script "compilethem" which will compile the programs and put the executables in the directory SYSTEM_diseq/exec which you just created.

There are seven programs included

  1. DISLAMB for single locus linkage disequilibrium analysis for a multiallelic system
  2. DISMULT for multilocus linkage disequilibrium analysis
  3. TDTLIKE for multiallelic likelihood-based version of the TDT test for linkage in pedigree material.
  4. MCNMULT for multilocus-multiallelic likelihood-based version of the McNemar test for linkage disequilibrium in pedigree material.
  5. HRRPREP - takes LINKAGE format pedigree file pedin.dat and LINKAGE format parameter file datain.dat, and from each pedigree selects one proband by looking for the first affected person in the pedigree typed at the first marker locus with parents also typed at that locus, and take his alleles as the case sample, and the nontransmitted alleles from the parents as the control sample (cf. Falk and Rubinstein, 1987; Terwilliger and Ott, 1992).
  6. HRRLAMB - same as DISLAMB but taking the data from diseqin.dat created by HRRPREP - note - this program only analyzes the first marker locus in your pedin.dat file.
  7. HRRMULT - same as DISMULT but takes the data from the diseqin.dat created by HRRPREP (note that the recombination fractions between markers are assumed to have been given in the parameter file, datain.dat.

DISLAMB and DISMULT are discribed in the documentation file diseq.doc, and TDTLIKE and MCNMULT are described in the file tdtlike.doc All these programs are based on algorithms largely presented in the following papers which should be cited when using them: HRRMULT and HRRLAMB are the same algorithm as DISLAMB and DISMULT, but the sampling of cases and control is done by HRRPREP according to the method outlined in Terwilliger and Ott, 1992.

FOR ALL OF THESE PROGRAMS CITE:

Terwilliger, JD (1995) " A Powerful Likelihood Method for the Analysis of Linkage Disequilibrium Between Trait Loci and One or More Polymorphic Marker Loci" Am J Hum Genet 56:777-787.

Terwilliger, JD (1996) "Response to Sham et al." Am J Hum Genet (In Press - May 1996).

FOR TDTLIKE or MCNMULT, quote the above articles plus

Terwilliger, JD (1996) " Pedigree-based Likelihood Methods for Analysis of Linkage and Linkage Disequilibrium with One or More Marker Loci" (Submitted)

FOR HRRPREP, HRRMULT, and HRRLAMB quote the first two articles plus

Falk CT, and Rubinstein P (1987) "Haplotype relative risks: an easy reliable way to construct a proper control sample for risk calculations" Ann Human Genet 51:227-233.

Terwilliger JD and Ott J (1992) "A haplotype-based haplotype relative risk statistic" Human Heredity 42:337-346.


diseq.doc


Likelihood - Based Disequilibrium Analysis Programs (2/21/96)

UPDATE - 21 February 1996:

Support interval calculations added for lambda and theta in dismult. Maximization routines improved in DISMULT and DISLAMB Haplotype Relative Risk extensions included to read LINKAGE files.

I wish to acknowledge some assistance in detecting a bug in the maximization routine discovered by Michael Knapp, and some programming advice from Alejandro Shaffer.

The algorithms used in these programs are described in detail in the following paper (which should be cited whenever these programs are used):

Terwilliger, JD "A Powerful Likelihood Method for the Analysis of Linkage Disequilibrium Between Trait Loci and One or More Polymorphic Marker Loci" American Journal of Human Genetics 56:777-787 (1995).

Terwilliger, JD "Response to Sham et al." Am J Hum Genet (1996) (In Press).


In this document I will only discuss technical issues regarding their use. Versions are included for either DEC machines or SUN machines. Where other UNIX versions differ, you may need to modify some of the file handling routines on your own. Compiled code is provided with statically linked libraries in case you have no Pascal compiler on your system for OSF, SUNOS, and SOLARIS.

In SUN Pascal, you should compile as follows:

pc -L -O4 -o dislamb dislamb.p
pc -L -O4 -o dismult dismult.p

and for DEC Pascal versions, under OSF it is recommended to compile with

pc -O4 -C all -o dislamb dislamb.p
pc -O4 -C all -o dismult dismult.p

And under VMS, just compile and link with

pas dislamb/nocheck
link dislamb/notraceback
pas dismult/nocheck
link dismult/notraceback

Then everything should work. If you still have trouble, contact me by sending EMAIL to joe@well.ox.ac.uk


DISLAMB is to be used for single locus association analysis. Just enter the data interactively - if the instructions are confusing let me know and I will try to make it more clear.

Consider the following dataset, using an assumption that the disease allele frequency is 0.001 (so that case sample provides little information for estimation of population allele frequencies), and using the nonparametric model (this is in general recommended...). This locus has 13 alleles, as follows:

CASE    10 48 12 1 3 17 2 5 1 2 3 4 1
CONTROL 8 8 10 3 3 15 2 7 3 4 2 5 2

Since the result was significant, I allowed it to estimate the recombination fraction between marker and disease under different assumptions about alpha and n - note that these estimates are very much dependent on your assumptions, and that your population is in equilibrium, so do not place to much faith in these estimates of map distance, as mentioned in the paper - support intervals are given (normally I would use the same stringency as the test in constructing support intervals for lambda - typically 3-lod-units. It is important to note that the support intervals for theta given do not take into account the userinput age of mutation in this population. To construct a support interval allowing for this, you must separately compute support intervals for each combination of age and heterogeneity parameter you wish to admit.

*************************************************************
*                                                           *
*       Program DISLAMB - Version 2.1   (2/3/96)            *
*                                                           *
*       AJHG 56:777-787 (1995)                              *
*                                                           *
*************************************************************


Disease allele frequency = 0.00100000

=========================================================================================
CASE      | 10. | 48. | 12. |  1. |  3. | 17. |  2. |  5. |  1. |  2. |  3. |  4. |  1. |
CONTROL   |  8. |  8. | 10. |  3. |  3. | 15. |  2. |  7. |  3. |  4. |  2. |  5. |  2. |
=========================================================================================

Estimated parameters for likelihood ratio test:
 Allele frequencies:
    Allele            H0:             H1
         1     0.09944751     0.12594046
         2     0.30939227     0.12355539
         3     0.12154696     0.15425412
         4     0.02209945     0.02805155
         5     0.03314917     0.04207248
         6     0.17679558     0.22436222
         7     0.02209945     0.02804928
         8     0.06629834     0.08413984
         9     0.02209945     0.02804972
        10     0.03314917     0.04207317
        11     0.02762431     0.03530303
        12     0.04972376     0.06311307
        13     0.01657459     0.02103565
    Lambda     0.00000000       0.361496

LRT Chi-Square =        19.87098 p-value =    0.000004200154388 Lambda =   0.361496
SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY LRT TEST
2 x n table Chi-square =   26.27953 P-value =    0.009797348469226
NO SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY 2 x N TABLE CHI-SQUARE TEST

3-lod-unit support interval for lambda-hat : (0.09,0.58)

Alpha = 0.70   N = 100.   Theta-hat = 0.006587   Support Interval = (0.001853,0.020142)


In this output file, you first get the results for the likelihood-based association test, which in this case is significant, with a p-value of 0.0000042 (significant because p < 0.0001). The estimate of lambda is 0.3615, which is evident of a not very strong association, even though the p-value is quite low. This is an important distinction, since the p-value tells you whether or not there is an association, while lambda serves as a measure of how strong the association is! The program also performd a more standard chi-square test of independence on the 2 x 13 table one could use to compare case and control allele frequency distributions. Here, that is not significant, having a p-value of only 0.0098 (it is assumed here that you have not proven the existence of a gene through linkage, which normally requires a p-value of 0.0001 (= Lod Score of 3), so you should have the same stringency in your association test - if not more, since there is more potential to have a multiple testing problem with association studies...). Afterwards, you can see how the value of lambda = 0.3615 can be translated into recombination fraction under certain assumptions about the age of this mutation in this population, and its degree of homogeneity.


DISMULT does the multipoint analysis of association. The algorithms are explained in the aforementioned paper, and to use it requires an input file called diseqin.dat, which is described below. Again it is not recommended to use the genotype data, but rather allele data, since it is more robust and straightforward. If you have detailed questions, send me a message.

Program DISMULT does an association analysis for multiple loci jointly, assuming a fixed map of the marker loci in either cM or kb (with some assumption about the relationship between cM and kb, since disequilibrium is a function of genetic, not physical distance.

The input file should be called DISEQIN.DAT

  1. On the first line, enter the Disease Allele frequency (if this is smaller than 0.01, it makes little difference what it is as it is used simply to estimate marker allele frequencies).

  2. On the next line enter the number of markers

  3. on the next line, enter your assumption about the number of centiMorgans/Megabase (assuming the distances are to be entered in kilobases. If you are entering the data in centiMorgans, then the value of this variable should be 1000 (since the distances are assumed to be entered in kb...) If this number is too small, n will be very large, and if this is too large, the estimates of n will be very small.

  4. On one line Enter the distances between markers in kb (or in cM if you have entered 1000 as the conversion factor in the preceding line). D(1,2) D(2,3) D(3,4) ... D(n-1,n)

  5. On the next line, enter the number of analysis points per intermarker interval (must be <= constant maxpos), and usually 5 is sufficient. This is analogous to number of evaluations in LINKMAP program of LINKAGE package. (It should be noted that in the outer intervals, 10 steps are taken over 10cM out from the map flanking markers)

  6. Minimum Value for n - the decay parameter
    usually 10 or 20 is best. If you find the maximum likelihood at a boundary value of n, try either lowering it, or making the markers more tightly linked (i.e. alter the assumption about cM/Mb). Note that for points outside the map, this will typically be at or near the boundary, since disequilibrium is found a large distance away, i.e. at (presumably) the first marker locus or later...

  7. Maximum Value for n - the decay parameter
    usually 250 - 500 is fine. If you hit a boundary value, raise this no. or make the markers more widely spaced... When this value is high, there is a rapid decay in the association as you move small distances from the point you are evaluating the likelihood at.

  8. For each marker enter the data in the following format for allele data
    NUMBER OF ALLELES AT MARKER
    CASE(1) CONTROL(1)
    CASE(2) CONTROL(2)
    CASE(3) CONTROL(3)
      ...
    
    
  9. Name the file DISEQIN.DAT, save it, and run the program DISMULT, Output is in a file called diseq.out

Example DISEQIN.DAT

0.022    Disease allele frequency
3        3 Markers
1        cM/Mb
500  120 Physical Distance between Markers in Kb
5        5 Points per Interval
20       Minimum n
800      Maximum n
2        Number of alleles at locus 1
48 28    Case(1) Control(1)
58 59
3        Locus 2 
70 10
6 44
10 40
5        Locus 3 
73 53
69 35
1 10
2 14
3 31

Output file DISEQ.OUT from the above problem:

To make a graph just plot the Map position against the "Lod Score" or -2ln(LR) which is interpreted conservatively as 0.5* Chi-square(1df). Quantities like alpha and n are the vertical and horizontal control parameters described in the paper, and the interval and position are given so that you can see where each marker is (i.e. everytime you have position 0, you have a new marker locus, the number of which is given in the Interval column).

*************************************************************
*       Program DISMULT - Version 2.1   (2/5/96)            *
*       AJHG 56:777-787 (1995)                              *
*       Multilocus Linkage Disequilibrium Analysis          *
*************************************************************


LOCUS 1

==============|=====|=====|
    ALLELE:   |  1  |  2  |
--------------|-----|-----|
    CASE:     | 48  | 58  |
    CONTROL:  | 28  | 59  |
==============|=====|=====|


LOCUS 2

==============|=====|=====|=====|
    ALLELE:   |  1  |  2  |  3  |
--------------|-----|-----|-----|
    CASE:     | 70  |  6  | 10  |
    CONTROL:  | 10  | 44  | 40  |
==============|=====|=====|=====|


LOCUS 3

==============|=====|=====|=====|=====|=====|
    ALLELE:   |  1  |  2  |  3  |  4  |  5  |
--------------|-----|-----|-----|-----|-----|
    CASE:     | 73  | 69  |  1  |  2  |  3  |
    CONTROL:  | 53  | 35  | 10  | 14  | 31  |
==============|=====|=====|=====|=====|=====|




     Alpha     N   Inter.  Position   Map Position      Lod Score       -2ln(LR)
  1.000000    390.      0         0       -4.25866        0.00000        0.00000
  1.000000     20.      0         1       -0.10034        6.07393       27.97147
  1.000000     20.      0         2       -0.08935        7.45386       34.32631
  1.000000     20.      0         3       -0.07859        8.97969       41.35302
  1.000000     20.      0         4       -0.06807       10.57762       48.71173
  1.000000     20.      0         5       -0.05776       12.08476       55.65239
  1.000000     20.      0         6       -0.04766       13.16856       60.64346
  0.916548     20.      0         7       -0.03775       13.36366       61.54191
  0.760124     20.      0         8       -0.02804       13.35679       61.51031
  0.631592     20.      0         9       -0.01852       13.35003       61.47918
  0.525249     20.      0        10       -0.00917       13.34339       61.44857
  0.437886     20.      1         0        0.00000       13.33684       61.41844
  0.439703     20.      1         1        0.00100       13.65798       62.89733
  0.445691     28.      1         2        0.00199       13.95389       64.26002
  0.999995    314.      1         3        0.00299       18.45487       84.98783
  0.999998    426.      1         4        0.00400       22.16551      102.07595
  0.782908    800.      1         5        0.00500       23.83323      109.75606
  0.782976    800.      2         0        0.00500       23.83322      109.75605
  0.883834    800.      2         1        0.00524       22.65050      104.30941
  0.959204    800.      2         2        0.00548       18.32883       84.40738
  0.521664    142.      2         3        0.00572       15.80475       72.78358
  0.485274     89.      2         4        0.00596       15.11418       69.60335
  0.452230     59.      2         5        0.00620       14.68567       67.63001
  0.452230     59.      3         0        0.00620       14.68567       67.63001
  0.821114     62.      3         1        0.01537       14.69946       67.69351
  0.999994     44.      3         2        0.02472       14.63657       67.40389
  0.999995     30.      3         3        0.03424       14.50510       66.79845
  0.999998     23.      3         4        0.04395       14.40457       66.33549
  1.000000     20.      3         5        0.05386       14.17518       65.27912
  1.000000     20.      3         6        0.06396       13.02926       60.00196
  1.000000     20.      3         7        0.07427       11.41072       52.54832
  1.000000     20.      3         8        0.08479        9.69005       44.62431
  1.000000     20.      3         9        0.09555        8.04766       37.06086
  1.000000     20.      3        10        0.10654        6.56494       30.23266


tdtlike.doc


TDTLIKE is a program which will compute the TDT statistic for linkage in family data, and its multiallelic likelihood based counterpart.

The TDT statistic looks at all parents of affected children who are heterozygous for a specific marker allele. Then it compares how often that specific marker allele is transmitted to the affected offspring from such heterozygous parents. If there is no linkage, then 50% of the time the H allele would be transmitted and 50% of the time the other allele would be transmitted. If there is linkage AND association, then the H allele would most likely be on the chromosome with the D allele, and they would be inherited together more than 50% of the time. Thus, a simple test of linkage between the two loci can be performed. Note that you MUST have BOTH association AND linkage to expect anything other than 50% transmission of allele H, if this is performed on a sample of singleton affecteds - if there are a small number of large pedigrees, a positive result does NOT mean there must be association.

The form of the statistic is very simple.

                       2
                (A - B)
             -------------
                (A + B)

where A is the number of affected offspring of heterozygous parents who inherited the H allele from the heterozygous parent, and B is the number of affected offspring of parents heterozygous for H which did not transmit the H allele to the affected child. Under the null hypothesis, A = B, for any given allele H, and this statistic is distributed, asymptotically, as a chi-square with 1 degree of freedom. Further, it should be done as a one-sided test, since the alternative hypothesis of interest is A > B, which indicates a positive association. However, in realistically sized datasets, it is easy enough (and more accurate) to compute the actual p-value from the Binomial distribution, and this is the way this program computes these p-values.

Of course, if there are multiple alleles at the marker locus, then you would need to try this statistic for each of the marker alleles in turn, and this multiple testing problem must be taken into account. In fact, application of a Bonferroni correction for n tests, where there are n alleles at a given locus, has been shown by simulation to give appropriate p-values in the multiple allele case. To this end, the TDTLIKE program only outputs these corrected one-sided p-values for each of the alleles. Of course, this does not take into account the problem of multiple markers, but it is something.

Additionally, there is a program constant 'min' which determines the minimum value for (A+B) above, below which a given marker allele will not be tested - this avoids a necessity for considering very rare alleles in the multiple testing corrections.

To try and make a more powerful statistic for multiallelic systems, I have adapted my likelihood-based disequilibrium model to the TDT design, allowing that one allele out of n would show an increased propensity to be transmitted and all other heterozygotes would transmit their alleles with 50% probability. Then the likelihood of the data can be formulated in terms of a value xi, where the probability that someone heterozygous H/? would transmit the H allele to its affected child with probability 0.5 + xi, such that the null hypothesis is that xi = 0. This multi-allelic test circumvents the problem of multiple testing, and is assumed to be approximately distributed asymptotically as a one-sided chi-square statistic with one degree of freedom. The corresponding p-values for this test are given as well by the program, where the test statistic is of the form:

                  max  L( xi > 0)
         2 ln   -------------------   
                     L(xi = 0)

To use this program is very simple. You need a LINKAGE format pedigree file, called PEDIN.DAT, and a corresponding parameter file called DATAIN.DAT. The first locus must also be the trait (disease) locus. The program only uses information from families where both parents are typed, and at least one is heterozygous for the locus under study (of course). The output is written to the screen and also to a file called TDT.OUT, which gives the TDT statistic for each allele corrected for multiple testing, and the overall multiallelic TDT statistic and corresponding p-value.

If you are testing the null hypothesis of no linkage disequilibrium, it is imperative that you have only singleton affected cases, not multiplex pedigrees. In this context, each marker locus statistic would be independent relative to the null hypothesis, though this is not the case when multiplex pedigrees are used as in the TDT design of Spielman et al (1993). A statistic with the same form has been referred to as a McNemar test, when association is being tested on singleton affecteds. For the McNemar there is a simple extension to multiple loci.

This statistic has been extended to multiple loci as well using similar arguments to those for the DISMULT program (as explained in the Am J Hum Genet paper below). The basic idea is that you assume the disease locus to be at a given point on the map of markers, and at that point you have a value for the strength of presumed association, alpha, and you determine that the association decays such that for a point at distance theta away, lambda = alpha(1 - theta)n, where lambda is the strength of the association as in the normal disequilibrium statistic outlined in the manuscript. So, the multipoint McNemar consists of the sum of the single locus McNemar LRT statistics, where each locus has its lambda determined as a function of the map distance between that marker and the presumed disease location, the horizontal decay parameter, n, which can be loosely interpreted as the age of the mutation, and alpha, the proportion of disease alleles estimated to be IBD from a common founder (carrying the associated alleles at each locus). This extension is also defined in Terwilliger (1995b).

If there are problems or if you detect any bugs, contact me at joe@well.ox.ac.uk


Also, in any publications resulting from the use of this program, please cite the following papers

Terwilliger, JD (1995a) "A Powerful Likelihood Method for the Analysis of Linkage Disequilibrium between Trait Loci and One or More Polymorphic Marker Loci" American Journal of Human Genetics 56:777-787.

Spielman RS, McGinnis RE, Ewens WJ (1993) "Transmission test for linkage disequilibrium: the insulin gene region and insulin dependent diabetes mellitus (IDDM)" Am J Hum Genet 52:506-516

Terwilliger, JD (1995b) "Pedigree-Based Likelihood Methods for Analysis of Linkage and Linkage Disequilibrium for One or More Polymorphic Marker Loci" (Under Review)