LECTURE 4: PRACTICAL ISSUES (I): DISEASE GENE MUTATION IS NOT DIRECTLY OBSERVED

The Basic Picture

(marker) <-------> (disease gene) -------> (affection status) <------- 2nd,3rd ...gene, environment
aa,aA,AA---------------d,D-------------Aff,Unaff----------------------------

Ideally, we would like to carry out the association analysis between disease gene alleles (d or D) with the affection status. In reality, disease gene is unknown and not directly observed. We carry out association analysis between the marker and the affection status instead.

The Nature of Disease Gene Mutation

(5' regulatory region) ---- (exon 1) --- ---- (exon 2) ---- ----- (3' regulatory region)

[COMMENT 1: Because there are two copies of each chromosome (maternally derived and paternally derived, the above scenarios apply to each copy of the chromosome. How the two are combinedly considered can be summarized as either dominant model, recessive model, or somewhere in-between. ]
[COMMENT 2: If there is a homologous gene at a different chromosome location (2-member multiple-gene family, or "paralogue") due to an ancestral duplication, there are 4 copies of that gene if the second gene has the same/similar function] [for more information on duplication, see http://www.nslij-genetics.org/duplication/]

The Association Between Two Neighboring Markers (Linkage Disequilibrium). 1. Phase Known

First marker has aa, aA, AA genotype, second marker has bb. bB. BB genotype (e.g. second marker is the disease gene). Suppose we can see which allele in the first marker is on the same chromosome (paternally derived or maternally derived) with the allele of the second marker, we basically count 2N (N is the number of persons) "haplotypes". There are four types of haplotypes: a-b, a-B, A-b, A-B. Exactly the same situation as the 2-by-2 table we discussed earlier:

b B total
a 10 90 100 pa= 0.1
A 400 500 900 pA= 0.9
total 410 590 1000
pb=0.41 pB=0.59

pa-b= 10/1000=0.01, lower than expected: pa pb= 0.1*0.41=0.041. The difference is: pa-b- pa pb = -0.031. Similarly
pa-B- pa pB = 0.09- 0.1*0.59= 0.031
pA-b- pA pb = 0.4 - 0.9*0.41= 0.031
pA-B- pA pB = 0.5- 0.9*0.59= -0.031
So even though there are 4 differences, only 1 independent value. 0.031 is called "D" (linkage disequilibrium coefficient). An even simpler formula: D= abs( pa-b pA-B - pA-b pa-B)

Most people don't use "D", but
normalized D, D-prime, D': D divided by the tightest bound of D value. This formula is used:
D' = D/bound(D)
bound(Da-b)=min( pa pB, pA pb) if Da-b > 0, and
bound(Da-b)= - min( pa pb, pA pB) if Da-b < 0.
Here bound(D)=0.041, so D'=0.031/0.041=0.756
r2: defined as
r2 = D2/(pa pA pb pB ).
Here, r2= 0.0312/(0.1*0.9*0.41*0.59)= 0.044.
What happens when one of the haplotype counts is zero? D' is always equal to 1.

b B total
a 0 100 100 pa= 0.1
A 410 490 900 pA= 0.9
total 410 590 1000
pb=0.41 pB=0.59

Da-b <0, so bound = min(0.1*0.41, 0.9*0.59)= min(0.041, 0.531) = 0.041. Since D=0.041, D'=1. However, r2= 0.0412/(0.1*0.9*0.41*0.59)=0.077.

[FURTHER READING: B Devlin, N Risch (1995), "A comparison of linkage disequilibrium measures for fine-scale mapping", Genomics, 29:311-322. PDF
SW Guo (1997), "Linkage disequilibrium measures for fine-scale mapping: a comparison", Human Heredity, 47(6):301-314. ]

Markers with more than two alleles: D'= sumij pij |D'ij|
For two-allele markers, all four D's are the same, so D'= |D'a-b| sumij pij = |D'a-b|

The Association Between Two Neighboring Markers (Linkage Disequilibrium). 2. Phase Unknown

In real situation, phase (parental origin) of an allele is unknown. The raw data is like this:

bb bB BB
aa 0 0 14
aA 0 4 34
AA 10 50 109
Only "double heterozygous" genotypes cause problems: aA-bB can be either (1) a-b and A-B; or (2) a-B and A-b.

Assume that situation (1) happens N1 times, and situation (2) happens N2 times (and N1+ N2=4), then

b B
a N1 14*2+34 +N2=62 +N2
A 10*2+50 +N2=70 +N2 34+50 + 109*2 + N1=302 +N1
If we knew the value of N1 and N2, then the four haplotype frequencies are:
pa-b = N1/442
pa-B = (62+N2)/442
pA-b = (70+N2)/442
pA-B = (302+N1)/442
On the other hand, N1 is proportional to pa-b pA-B, and N2 is proportional to pa-B pA-b.

We can iterate the process by assuming N1/N2=1 first (N1=2, N2=2), which leads to pa-b pA-B/pa-B pA-b= (2*304)/(64*72)= 0.13 = N1/N2. So next we assume N1=1, N2=3, which leads to pa-b pA-B/pa-B pA-b= (1*303)/(65*73)=0.06 = N1/N2. Continuing the iteration, we assume N1=0, N2=4, which leads to pa-b pA-B/pa-B pA-b= (0*302)/(66*74)=0 = N1/N2. No further improvement can be made, and the iteration stops.

What described above is the so-called "E-M" (expectation-maximization) algorithm to maximize the likelihood.

Here is the iteration in EM algorithm

N1 pa-b pa-B pA-b pA-B solving N1
2 (half-half chance) 0.004524887 0.144796380 0.162895928 0.687782805 0.4662575 -> 0
0 0 0.1493213 0.1674208 0.6832579 0 (no change in N1, stop)

[COMMENTS: 1. Hardy-Weinberg equilibrium is assumed.
2. It is not necessary to know the distance between the two marker locations (other unambiguous situations tell the story).
3. This new paper: Mano et al. (2004), "Notes on the maximum likelihood estimation of haplotype frequencies", Annals of Human Genetics, 68:257-264, suggests that if the difference of "coupling" and "repulsive" haplotype frequencies in phase known individuals is smaller than 1.5 times of the frequency of phase unknown individuals, one should be careful with the result. ]

Programs that use EM algorithm:

Reconstruct Haplotypes for a Group of Individuals with Many SNPs

AG Clark (1990), "Inference of haplotypes from PCR-amplified samples of diploid populations", Molecular Biology and Evolution, 7:111-122. PDF

M Stephens, NJ Smith, P Donnelly (2001), "A new statistical method for haplotype reconstruction from population data", American Journal of Human Genetics, 68:978-989. PDF More different approaches are described in http://www.meb.ki.se/genestat/tl/genass_ldmap/measuring_ld/estimation_haplo/haplotype_estimation.htm

Reconstruct Haplotypes When the Relatives' Genotypes Are Available

In general, whenever pedigree data is dealt with instead of population data, we are doing "linkage analysis" instead of "association analysis". These linkage analysis programs can be used for haplotype reconstruction:

MERLIN: http://www.sph.umich.edu/csg/abecasis/Merlin/tour/haplotyping.html

GENEHUNTER: http://www.nslij-genetics.org/soft/gh/hap.html

SIMWALK : http://www.genetics.ucla.edu/software/simwalk_doc/

However, when parents are not typed, the reconstruction result may not be reliable: W Li, PG Gregersen (2004), "Reconstructing haplotypes in pedigrees: the importance of parental information", American Journal of Medical Genetics, 124A:107-109. PDF

Haplotype Blocks in Human Genome

in Hinds et al (2005) paper, african-american: haplotype block size ~ 8.8kb, european-american ~ 20.7kb, han chinese ~25.2 kb

Using Haplotype Or Allele to Detect Association?

The principle is that the allele/haplotype frequency should match the mutation frequency.

Example:

a-(d)-b-c 10%
a-(d)-b-C 20%
a-(d)-B-C 40%
A-(D)-B-C 30%
For the first SNP, P(A)=30%, matching the mutation allele frequency.
For the second SNP, P(B)=70%, P(b)=30% but allele-b is on the mutation-carrying haplotype.
For the third SNP, P(C)=90%, P(c)=10%, the matching is the worst.

The haplotype A-B-C, also matches the mutation perfectly.

On the other hand, haplotype B-C does not have a perfect linkage disequilibrium with the mutation. It can be checked by the haplotype frequency: P(B-C)=70%.

In general, because the number of haplotypes is larger than the number of single-marker alleles, it is more likely that a haplotype frequency matches better the the frequency of the disease locus mutation allele. As a result, haplotype-based association analysis has a higher potential to detect the mutation than a single marker.

Summary: From Affection Status to Disease Gene to Marker

See the paper: KT Zondervan, LR Cardon (2004), "The complex interplay among factors that influence allelic association", Nature Reviews Genetics, 5:89-100. PDF

They proposed four things to consider for an association analysis. Three of them is related to this lecture: frequency of the disease allele, frequency of the marker allele, and the extent of linkage disequilibrium between the marker and the disease locus.

The reason that these factors are important because we do not see the disease gene directly! The bridge between the disease gene and the marker is crucial in a success for establihsing a connection between the disease gene and the affection status.