| LECTURE 4: PRACTICAL ISSUES (I): DISEASE GENE MUTATION IS NOT DIRECTLY OBSERVED |
(marker) <-------> (disease gene) -------> (affection status) <------- 2nd,3rd ...gene, environment
aa,aA,AA---------------d,D-------------Aff,Unaff----------------------------
Ideally, we would like to carry out the association analysis between disease gene alleles (d or D) with the affection status. In reality, disease gene is unknown and not directly observed. We carry out association analysis between the marker and the affection status instead.
(5' regulatory region) ---- (exon 1) --- ---- (exon 2) ---- ----- (3' regulatory region)
First marker has aa, aA, AA genotype, second marker has bb. bB. BB genotype (e.g. second marker is the disease gene). Suppose we can see which allele in the first marker is on the same chromosome (paternally derived or maternally derived) with the allele of the second marker, we basically count 2N (N is the number of persons) "haplotypes". There are four types of haplotypes: a-b, a-B, A-b, A-B. Exactly the same situation as the 2-by-2 table we discussed earlier:
| b | B | total | ||
| a | 10 | 90 | 100 | pa= 0.1 |
| A | 400 | 500 | 900 | pA= 0.9 |
| total | 410 | 590 | 1000 | |
| pb=0.41 | pB=0.59 |
pa-b= 10/1000=0.01, lower than expected: pa pb= 0.1*0.41=0.041.
The difference is: pa-b- pa pb = -0.031. Similarly
pa-B- pa pB = 0.09- 0.1*0.59= 0.031
pA-b- pA pb = 0.4 - 0.9*0.41= 0.031
pA-B- pA pB = 0.5- 0.9*0.59= -0.031
So even though there are 4 differences, only 1 independent value. 0.031 is called "D" (linkage
disequilibrium coefficient). An even simpler formula:
D= abs( pa-b pA-B - pA-b pa-B)
Most people don't use "D", but
|
normalized D, D-prime, D':
D divided by the tightest bound of D value. This formula is used:
bound(Da-b)=min( pa pB, pA pb) if Da-b > 0, and bound(Da-b)= - min( pa pb, pA pB) if Da-b < 0. |
|
r2: defined as
|
| b | B | total | ||
| a | 0 | 100 | 100 | pa= 0.1 |
| A | 410 | 490 | 900 | pA= 0.9 |
| total | 410 | 590 | 1000 | |
| pb=0.41 | pB=0.59 |
Da-b <0, so bound = min(0.1*0.41, 0.9*0.59)= min(0.041, 0.531) = 0.041. Since D=0.041, D'=1. However, r2= 0.0412/(0.1*0.9*0.41*0.59)=0.077.
[FURTHER READING:
B Devlin, N Risch (1995), "A comparison of linkage disequilibrium measures for
fine-scale mapping", Genomics, 29:311-322. PDF
SW Guo (1997),
"Linkage disequilibrium measures for fine-scale mapping: a comparison",
Human Heredity, 47(6):301-314.
]
Markers with more than two alleles:
D'= sumij pij |D'ij|
For two-allele markers, all four D's are the same, so
D'= |D'a-b| sumij pij
= |D'a-b|
In real situation, phase (parental origin) of an allele is unknown. The raw data is like this:
| bb | bB | BB | |
| aa | 0 | 0 | 14 |
| aA | 0 | 4 | 34 |
| AA | 10 | 50 | 109 |
Assume that situation (1) happens N1 times, and situation (2) happens N2 times (and N1+ N2=4), then
| b | B | |
| a | N1 | 14*2+34 +N2=62 +N2 |
| A | 10*2+50 +N2=70 +N2 | 34+50 + 109*2 + N1=302 +N1 |
We can iterate the process by assuming N1/N2=1 first (N1=2, N2=2), which leads to pa-b pA-B/pa-B pA-b= (2*304)/(64*72)= 0.13 = N1/N2. So next we assume N1=1, N2=3, which leads to pa-b pA-B/pa-B pA-b= (1*303)/(65*73)=0.06 = N1/N2. Continuing the iteration, we assume N1=0, N2=4, which leads to pa-b pA-B/pa-B pA-b= (0*302)/(66*74)=0 = N1/N2. No further improvement can be made, and the iteration stops.
What described above is the so-called "E-M" (expectation-maximization) algorithm to maximize the likelihood.
Here is the iteration in EM algorithm
| N1 | pa-b | pa-B | pA-b | pA-B | solving N1 |
| 2 (half-half chance) | 0.004524887 | 0.144796380 | 0.162895928 | 0.687782805 | 0.4662575 -> 0 |
| 0 | 0 | 0.1493213 | 0.1674208 | 0.6832579 | 0 (no change in N1, stop) |
[COMMENTS:
1. Hardy-Weinberg equilibrium is assumed.
2. It is not necessary to know the distance between the two
marker locations (other unambiguous situations tell the story).
3. This new paper: Mano et al. (2004), "Notes on the maximum
likelihood estimation of haplotype frequencies",
Annals of Human Genetics, 68:257-264, suggests that if
the difference of "coupling" and "repulsive" haplotype
frequencies in phase known individuals is smaller than
1.5 times of the frequency of phase unknown individuals,
one should be careful with the result.
]
Programs that use EM algorithm:
AG Clark (1990), "Inference of haplotypes from PCR-amplified samples of diploid populations", Molecular Biology and Evolution, 7:111-122. PDF
In general, whenever pedigree data is dealt with instead of population data, we are doing "linkage analysis" instead of "association analysis". These linkage analysis programs can be used for haplotype reconstruction:
MERLIN: http://www.sph.umich.edu/csg/abecasis/Merlin/tour/haplotyping.html
GENEHUNTER: http://www.nslij-genetics.org/soft/gh/hap.html
SIMWALK : http://www.genetics.ucla.edu/software/simwalk_doc/
However, when parents are not typed, the reconstruction result may not be reliable: W Li, PG Gregersen (2004), "Reconstructing haplotypes in pedigrees: the importance of parental information", American Journal of Medical Genetics, 124A:107-109. PDF
| chromosome | physical map (Mb) | sex-ave genetic map (cM) | male | female | ch1 | 245.1 | 286.5 | 221.7 | 358.0 |
| ch2 | 243.2 | 263.3 | 191.6 | 338.6 |
| ch3 | 199.0 | 225.1 | 170.2 | 282.3 |
| ch4 | 191.5 | 212.2 | 154.7 | 273.2 |
| ch5 | 180.3 | 208.2 | 155.5 | 264.9 |
| ch6 | 170.0 | 192.2 | 140.9 | 247.6 |
| ch7 | 158.1 | 189.0 | 142.2 | 237.8 |
| ch8 | 145.7 | 173.3 | 132.3 | 216.5 |
| ch9 | 135.8 | 168.7 | 141.1 | 198.4 |
| ch10 | 134.6 | 173.5 | 133.2 | 216.5 |
| ch11 | 134.1 | 163.8 | 124.3 | 205.3 |
| ch12 | 131.4 | 174.2 | 137.2 | 213.8 |
| ch13 | 94.5 | 128.9 | 102.5 | 157.0 |
| ch14 | 85.1 | 123.8 | 101.8 | 146.7 |
| ch15 | 79.7 | 130.2 | 106.5 | 157.1 |
| ch16 | 89.5 | 134.2 | 110.9 | 159.5 |
| ch17 | 81.2 | 137.5 | 113.0 | 164.6 |
| ch18 | 75.3 | 102.3 | 124.2 | 148.3 |
| ch19 | 62.8 | 112.2 | 99.3 | 126.3 |
| ch20 | 81.4 | 102.5 | 82.1 | 125.3 |
| ch21 | 32.2 | 68.5 | 58.4 | 80.5 |
| ch22 | 33.1 | 86.1 | 79.4 | 93.4 |
The principle is that the allele/haplotype frequency should match the mutation frequency.
Example:
The haplotype A-B-C, also matches the mutation perfectly.
On the other hand, haplotype B-C does not have a perfect linkage disequilibrium with the mutation. It can be checked by the haplotype frequency: P(B-C)=70%.
In general, because the number of haplotypes is larger than the number of single-marker alleles, it is more likely that a haplotype frequency matches better the the frequency of the disease locus mutation allele. As a result, haplotype-based association analysis has a higher potential to detect the mutation than a single marker.
| Summary: From Affection Status to Disease Gene to Marker |
See the paper: KT Zondervan, LR Cardon (2004), "The complex interplay among factors that influence allelic association", Nature Reviews Genetics, 5:89-100. PDF
They proposed four things to consider for an association analysis. Three of them is related to this lecture: frequency of the disease allele, frequency of the marker allele, and the extent of linkage disequilibrium between the marker and the disease locus.
The reason that these factors are important because we do not see the disease gene directly! The bridge between the disease gene and the marker is crucial in a success for establihsing a connection between the disease gene and the affection status.