UPDATE ANNOUNCEMENT FOR LINKAGE DISEQUILIBRIUM ANALYSIS PROGRAMS
A new version of the programs DISLAMB and DISMULT for likelihood based linkage disequilibrium analysis is now available via ftp.
NEW PROGRAM VERSION 2-22-96 Updates include
If there are questions, refer them to
Joseph D. Terwilliger
FAX +44 - 1865 - 742 196
Wellcome Trust Center for Human Genetics
Windmill Road
Headington OX3 7BN
Oxford, UK
Ph. +44 - 1865 - 740 016
EMAIL joe.terwilliger@well.ox.ac.uk
OR
Joseph D. Terwilliger
FAX +1 - 212 - 568-2750
Columbia University, Unit 58
722 West 168th Street
New York, NY 10032
USA
Ph. +1 - 212 - 960-5863
EMAIL jdt3@columbia.edu
To get the programs and documentation for DEC and SUN machines under either UNIX or VMS, just do the following:
From OXFORD:
From NEW YORK
Then, on your own machine
uncompress FILENAME.tar.Z
tar xvf FILENAME.tar
There are a number of files available, as follows:
SYSTEM_diseq.tar.Z = source, executables, and documentation
SYSTEM_diseq_source.tar.Z = source and documentation
Files are available for the following SYSTEMs - OSF,SOLARIS, SUNOS, VMS. The SUNOS_diseq_source.tar.Z contains a fairly standard UNIX version of the code which could be recompiled with minimal or no changes (involving file handling if anything for some systems...) on your different UNIX system. If you download only the source files, then they will be in a directory SYSTEM_diseq/source. You shouls first create a directory SYSTEM_diseq/exec, and then to compile the programs, go to the SYSTEM_diseq/source directory and run the included shell script "compilethem" which will compile the programs and put the executables in the directory SYSTEM_diseq/exec which you just created.
There are seven programs included
DISLAMB and DISMULT are discribed in the documentation file diseq.doc, and TDTLIKE and MCNMULT are described in the file tdtlike.doc All these programs are based on algorithms largely presented in the following papers which should be cited when using them: HRRMULT and HRRLAMB are the same algorithm as DISLAMB and DISMULT, but the sampling of cases and control is done by HRRPREP according to the method outlined in Terwilliger and Ott, 1992.
FOR ALL OF THESE PROGRAMS CITE:
Terwilliger, JD (1995) " A Powerful Likelihood Method for the Analysis of Linkage
Disequilibrium Between Trait Loci and One or More Polymorphic Marker Loci" Am J Hum
Genet 56:777-787.
Terwilliger, JD (1996) "Response to Sham et al." Am J Hum Genet (In Press - May 1996).
FOR TDTLIKE or MCNMULT, quote the above articles plus
Terwilliger, JD (1996) " Pedigree-based Likelihood Methods for Analysis
of Linkage and Linkage Disequilibrium with One or More Marker Loci"
(Submitted)
FOR HRRPREP, HRRMULT, and HRRLAMB quote the first two articles plus
Falk CT, and Rubinstein P (1987) "Haplotype relative risks: an easy reliable way to
construct a proper control sample for risk calculations" Ann Human Genet 51:227-233.
Terwilliger JD and Ott J (1992) "A haplotype-based haplotype relative risk statistic"
Human Heredity 42:337-346.
Likelihood - Based Disequilibrium Analysis Programs (2/21/96)
UPDATE - 21 February 1996:
Support interval calculations added for lambda and theta in dismult. Maximization routines improved in DISMULT and DISLAMB Haplotype Relative Risk extensions included to read LINKAGE files.
I wish to acknowledge some assistance in detecting a bug in the maximization routine discovered by Michael Knapp, and some programming advice from Alejandro Shaffer.
The algorithms used in these programs are described in detail
in the following paper (which should be cited whenever these
programs are used):
Terwilliger, JD "A Powerful Likelihood Method for the Analysis of
Linkage Disequilibrium Between Trait Loci and One or More
Polymorphic Marker Loci" American Journal of Human Genetics
56:777-787 (1995).
Terwilliger, JD "Response to Sham et al." Am J Hum Genet (1996)
(In Press).
In this document I will only discuss technical issues regarding their use. Versions are included for either DEC machines or SUN machines. Where other UNIX versions differ, you may need to modify some of the file handling routines on your own. Compiled code is provided with statically linked libraries in case you have no Pascal compiler on your system for OSF, SUNOS, and SOLARIS.
In SUN Pascal, you should compile as follows:
pc -L -O4 -o dislamb dislamb.p
pc -L -O4 -o dismult dismult.p
and for DEC Pascal versions, under OSF it is recommended to compile with
pc -O4 -C all -o dislamb dislamb.p
pc -O4 -C all -o dismult dismult.p
And under VMS, just compile and link with
pas dislamb/nocheck
link dislamb/notraceback
pas dismult/nocheck
link dismult/notraceback
Then everything should work. If you still have trouble, contact me by sending EMAIL to joe@well.ox.ac.uk
DISLAMB is to be used for single locus association analysis. Just enter the data interactively - if the instructions are confusing let me know and I will try to make it more clear.
Consider the following dataset, using an assumption that the disease allele frequency is 0.001 (so that case sample provides little information for estimation of population allele frequencies), and using the nonparametric model (this is in general recommended...). This locus has 13 alleles, as follows:
CASE 10 48 12 1 3 17 2 5 1 2 3 4 1
CONTROL 8 8 10 3 3 15 2 7 3 4 2 5 2
Since the result was significant, I allowed it to estimate the recombination fraction between marker and disease under different assumptions about alpha and n - note that these estimates are very much dependent on your assumptions, and that your population is in equilibrium, so do not place to much faith in these estimates of map distance, as mentioned in the paper - support intervals are given (normally I would use the same stringency as the test in constructing support intervals for lambda - typically 3-lod-units. It is important to note that the support intervals for theta given do not take into account the userinput age of mutation in this population. To construct a support interval allowing for this, you must separately compute support intervals for each combination of age and heterogeneity parameter you wish to admit.
*************************************************************
* *
* Program DISLAMB - Version 2.1 (2/3/96) *
* *
* AJHG 56:777-787 (1995) *
* *
*************************************************************
Disease allele frequency = 0.00100000
=========================================================================================
CASE | 10. | 48. | 12. | 1. | 3. | 17. | 2. | 5. | 1. | 2. | 3. | 4. | 1. |
CONTROL | 8. | 8. | 10. | 3. | 3. | 15. | 2. | 7. | 3. | 4. | 2. | 5. | 2. |
=========================================================================================
Estimated parameters for likelihood ratio test:
Allele frequencies:
Allele H0: H1
1 0.09944751 0.12594046
2 0.30939227 0.12355539
3 0.12154696 0.15425412
4 0.02209945 0.02805155
5 0.03314917 0.04207248
6 0.17679558 0.22436222
7 0.02209945 0.02804928
8 0.06629834 0.08413984
9 0.02209945 0.02804972
10 0.03314917 0.04207317
11 0.02762431 0.03530303
12 0.04972376 0.06311307
13 0.01657459 0.02103565
Lambda 0.00000000 0.361496
LRT Chi-Square = 19.87098 p-value = 0.000004200154388 Lambda = 0.361496
SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY LRT TEST
2 x n table Chi-square = 26.27953 P-value = 0.009797348469226
NO SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY 2 x N TABLE CHI-SQUARE TEST
3-lod-unit support interval for lambda-hat : (0.09,0.58)
Alpha = 0.70 N = 100. Theta-hat = 0.006587 Support Interval = (0.001853,0.020142)
In this output file, you first get the results for the likelihood-based association test, which in this case is significant, with a p-value of 0.0000042 (significant because p < 0.0001). The estimate of lambda is 0.3615, which is evident of a not very strong association, even though the p-value is quite low. This is an important distinction, since the p-value tells you whether or not there is an association, while lambda serves as a measure of how strong the association is! The program also performd a more standard chi-square test of independence on the 2 x 13 table one could use to compare case and control allele frequency distributions. Here, that is not significant, having a p-value of only 0.0098 (it is assumed here that you have not proven the existence of a gene through linkage, which normally requires a p-value of 0.0001 (= Lod Score of 3), so you should have the same stringency in your association test - if not more, since there is more potential to have a multiple testing problem with association studies...). Afterwards, you can see how the value of lambda = 0.3615 can be translated into recombination fraction under certain assumptions about the age of this mutation in this population, and its degree of homogeneity.
DISMULT does the multipoint analysis of association. The algorithms are explained in the aforementioned paper, and to use it requires an input file called diseqin.dat, which is described below. Again it is not recommended to use the genotype data, but rather allele data, since it is more robust and straightforward. If you have detailed questions, send me a message.
Program DISMULT does an association analysis for multiple loci jointly, assuming a fixed map of the marker loci in either cM or kb (with some assumption about the relationship between cM and kb, since disequilibrium is a function of genetic, not physical distance.
The input file should be called DISEQIN.DAT
NUMBER OF ALLELES AT MARKER CASE(1) CONTROL(1) CASE(2) CONTROL(2) CASE(3) CONTROL(3) ...
Example DISEQIN.DAT
0.022 Disease allele frequency 3 3 Markers 1 cM/Mb 500 120 Physical Distance between Markers in Kb 5 5 Points per Interval 20 Minimum n 800 Maximum n 2 Number of alleles at locus 1 48 28 Case(1) Control(1) 58 59 3 Locus 2 70 10 6 44 10 40 5 Locus 3 73 53 69 35 1 10 2 14 3 31
Output file DISEQ.OUT from the above problem:
To make a graph just plot the Map position against the "Lod Score" or -2ln(LR) which is interpreted conservatively as 0.5* Chi-square(1df). Quantities like alpha and n are the vertical and horizontal control parameters described in the paper, and the interval and position are given so that you can see where each marker is (i.e. everytime you have position 0, you have a new marker locus, the number of which is given in the Interval column).
*************************************************************
* Program DISMULT - Version 2.1 (2/5/96) *
* AJHG 56:777-787 (1995) *
* Multilocus Linkage Disequilibrium Analysis *
*************************************************************
LOCUS 1
==============|=====|=====|
ALLELE: | 1 | 2 |
--------------|-----|-----|
CASE: | 48 | 58 |
CONTROL: | 28 | 59 |
==============|=====|=====|
LOCUS 2
==============|=====|=====|=====|
ALLELE: | 1 | 2 | 3 |
--------------|-----|-----|-----|
CASE: | 70 | 6 | 10 |
CONTROL: | 10 | 44 | 40 |
==============|=====|=====|=====|
LOCUS 3
==============|=====|=====|=====|=====|=====|
ALLELE: | 1 | 2 | 3 | 4 | 5 |
--------------|-----|-----|-----|-----|-----|
CASE: | 73 | 69 | 1 | 2 | 3 |
CONTROL: | 53 | 35 | 10 | 14 | 31 |
==============|=====|=====|=====|=====|=====|
Alpha N Inter. Position Map Position Lod Score -2ln(LR)
1.000000 390. 0 0 -4.25866 0.00000 0.00000
1.000000 20. 0 1 -0.10034 6.07393 27.97147
1.000000 20. 0 2 -0.08935 7.45386 34.32631
1.000000 20. 0 3 -0.07859 8.97969 41.35302
1.000000 20. 0 4 -0.06807 10.57762 48.71173
1.000000 20. 0 5 -0.05776 12.08476 55.65239
1.000000 20. 0 6 -0.04766 13.16856 60.64346
0.916548 20. 0 7 -0.03775 13.36366 61.54191
0.760124 20. 0 8 -0.02804 13.35679 61.51031
0.631592 20. 0 9 -0.01852 13.35003 61.47918
0.525249 20. 0 10 -0.00917 13.34339 61.44857
0.437886 20. 1 0 0.00000 13.33684 61.41844
0.439703 20. 1 1 0.00100 13.65798 62.89733
0.445691 28. 1 2 0.00199 13.95389 64.26002
0.999995 314. 1 3 0.00299 18.45487 84.98783
0.999998 426. 1 4 0.00400 22.16551 102.07595
0.782908 800. 1 5 0.00500 23.83323 109.75606
0.782976 800. 2 0 0.00500 23.83322 109.75605
0.883834 800. 2 1 0.00524 22.65050 104.30941
0.959204 800. 2 2 0.00548 18.32883 84.40738
0.521664 142. 2 3 0.00572 15.80475 72.78358
0.485274 89. 2 4 0.00596 15.11418 69.60335
0.452230 59. 2 5 0.00620 14.68567 67.63001
0.452230 59. 3 0 0.00620 14.68567 67.63001
0.821114 62. 3 1 0.01537 14.69946 67.69351
0.999994 44. 3 2 0.02472 14.63657 67.40389
0.999995 30. 3 3 0.03424 14.50510 66.79845
0.999998 23. 3 4 0.04395 14.40457 66.33549
1.000000 20. 3 5 0.05386 14.17518 65.27912
1.000000 20. 3 6 0.06396 13.02926 60.00196
1.000000 20. 3 7 0.07427 11.41072 52.54832
1.000000 20. 3 8 0.08479 9.69005 44.62431
1.000000 20. 3 9 0.09555 8.04766 37.06086
1.000000 20. 3 10 0.10654 6.56494 30.23266
TDTLIKE is a program which will compute the TDT statistic for linkage in family data, and its multiallelic likelihood based counterpart.
The TDT statistic looks at all parents of affected children who are heterozygous for a specific marker allele. Then it compares how often that specific marker allele is transmitted to the affected offspring from such heterozygous parents. If there is no linkage, then 50% of the time the H allele would be transmitted and 50% of the time the other allele would be transmitted. If there is linkage AND association, then the H allele would most likely be on the chromosome with the D allele, and they would be inherited together more than 50% of the time. Thus, a simple test of linkage between the two loci can be performed. Note that you MUST have BOTH association AND linkage to expect anything other than 50% transmission of allele H, if this is performed on a sample of singleton affecteds - if there are a small number of large pedigrees, a positive result does NOT mean there must be association.
The form of the statistic is very simple.
2
(A - B)
-------------
(A + B)
where A is the number of affected offspring of heterozygous parents who inherited the H allele from the heterozygous parent, and B is the number of affected offspring of parents heterozygous for H which did not transmit the H allele to the affected child. Under the null hypothesis, A = B, for any given allele H, and this statistic is distributed, asymptotically, as a chi-square with 1 degree of freedom. Further, it should be done as a one-sided test, since the alternative hypothesis of interest is A > B, which indicates a positive association. However, in realistically sized datasets, it is easy enough (and more accurate) to compute the actual p-value from the Binomial distribution, and this is the way this program computes these p-values.
Of course, if there are multiple alleles at the marker locus, then you would need to try this statistic for each of the marker alleles in turn, and this multiple testing problem must be taken into account. In fact, application of a Bonferroni correction for n tests, where there are n alleles at a given locus, has been shown by simulation to give appropriate p-values in the multiple allele case. To this end, the TDTLIKE program only outputs these corrected one-sided p-values for each of the alleles. Of course, this does not take into account the problem of multiple markers, but it is something.
Additionally, there is a program constant 'min' which determines the minimum value for (A+B) above, below which a given marker allele will not be tested - this avoids a necessity for considering very rare alleles in the multiple testing corrections.
To try and make a more powerful statistic for multiallelic systems, I have adapted my likelihood-based disequilibrium model to the TDT design, allowing that one allele out of n would show an increased propensity to be transmitted and all other heterozygotes would transmit their alleles with 50% probability. Then the likelihood of the data can be formulated in terms of a value xi, where the probability that someone heterozygous H/? would transmit the H allele to its affected child with probability 0.5 + xi, such that the null hypothesis is that xi = 0. This multi-allelic test circumvents the problem of multiple testing, and is assumed to be approximately distributed asymptotically as a one-sided chi-square statistic with one degree of freedom. The corresponding p-values for this test are given as well by the program, where the test statistic is of the form:
max L( xi > 0)
2 ln -------------------
L(xi = 0)
To use this program is very simple. You need a LINKAGE format pedigree file, called PEDIN.DAT, and a corresponding parameter file called DATAIN.DAT. The first locus must also be the trait (disease) locus. The program only uses information from families where both parents are typed, and at least one is heterozygous for the locus under study (of course). The output is written to the screen and also to a file called TDT.OUT, which gives the TDT statistic for each allele corrected for multiple testing, and the overall multiallelic TDT statistic and corresponding p-value.
If you are testing the null hypothesis of no linkage disequilibrium, it is imperative that you have only singleton affected cases, not multiplex pedigrees. In this context, each marker locus statistic would be independent relative to the null hypothesis, though this is not the case when multiplex pedigrees are used as in the TDT design of Spielman et al (1993). A statistic with the same form has been referred to as a McNemar test, when association is being tested on singleton affecteds. For the McNemar there is a simple extension to multiple loci.
This statistic has been extended to multiple loci as well using similar arguments to those for the DISMULT program (as explained in the Am J Hum Genet paper below). The basic idea is that you assume the disease locus to be at a given point on the map of markers, and at that point you have a value for the strength of presumed association, alpha, and you determine that the association decays such that for a point at distance theta away, lambda = alpha(1 - theta)n, where lambda is the strength of the association as in the normal disequilibrium statistic outlined in the manuscript. So, the multipoint McNemar consists of the sum of the single locus McNemar LRT statistics, where each locus has its lambda determined as a function of the map distance between that marker and the presumed disease location, the horizontal decay parameter, n, which can be loosely interpreted as the age of the mutation, and alpha, the proportion of disease alleles estimated to be IBD from a common founder (carrying the associated alleles at each locus). This extension is also defined in Terwilliger (1995b).
If there are problems or if you detect any bugs, contact me at joe@well.ox.ac.uk
Also, in any publications resulting from the use of this program,
please cite the following papers
Terwilliger, JD (1995a) "A Powerful Likelihood Method for the Analysis of
Linkage Disequilibrium between Trait Loci and One or More Polymorphic
Marker Loci" American Journal of Human Genetics 56:777-787.
Spielman RS, McGinnis RE, Ewens WJ (1993) "Transmission test for linkage
disequilibrium: the insulin gene region and insulin dependent diabetes
mellitus (IDDM)" Am J Hum Genet 52:506-516
Terwilliger, JD (1995b) "Pedigree-Based Likelihood Methods for Analysis of
Linkage and Linkage Disequilibrium for One or More Polymorphic Marker
Loci" (Under Review)