LECTURE 5: PRACTICAL ISSUES (II): HETEROGENEITY

The Concept of Population

In population genetics, a population is usually related to random mating. Random mating leads to homogeneity. Homogeneity makes it possible to define genotype and allele frequencies.

In Sewall Wright's four-volume book, Evolution and the Genetics of Population, Vol 2 is called "The Theory of Gene Frequencies" (1969).

Different Types of Heterogeneity in Association Analysis

Inhomogeneity (heterogeneity) is one the two major difficutlies in association analysis.

Various types of heterogeneity in case group:

  1. Locus heterogeneity: genes at more than one chromosomal locations are related to the disease
    This is a problem for both linkage analysis and association analysis
  2. Allelic heterogeneity: even if only one gene is related to the disease, different forms of mutation at different within-the-gene locations damage the gene.
    This is not a problem for linkage analysis, but a problem for association analysis
See: JD Terwilliger, KM Weiss (1998), "Linkage disequilibrium mapping and complex disease: fantasy or reality?", Current Opinion in Biotechnology, 9(6):578-594. PDF
JK Pritchard, NJ Cox (2002), "The allelic architecture of human disease genes: common disease-common variant... or not?", Human Molecular Genetics, 11:2417-2423. PDF

Various types of heterogeneity in control group:

What Happens to the Association Analysis When There is a Heterogeneity? Simpson's Paradox

See: http://en.wikipedia.org/wiki/Simpson's_paradox

Considering the following two populations:

a A
case 60 40
control 9 1
OR= 0.167, 95%CI = (0.02, 0.127)
a A
case 1 9
control 30 70
OR= 0.259, 95%CI = (0.03, 2.138)

In both subpopulations, OR is smaller than 1, so allele "A" is preferred in case group. However, when we combine the two subpopulations to form one population:
a A
case 61 49
control 39 71
OR = 2.266, 95% CI = (1.317, 3.898)
which prefers allele "a" in the case group.

In general, if severe heterogeneity is not removed, an association analysis result cannot be trusted.

The heterogeneity in this example is very easy to detect: the control group allele frequency pa is changed from 90% in population-1 to 30% in population-2. Similarly, the case group pa is changed from 60% in population-1 to 10% in population-2.

Solutions to Heterogeneity

Heterogeneity in Phenotype and Affection Status Definition

The most famous examples are the psychiatric disorders. There is no other solution but to restrict/purify/stratify the phenotype definition.

Family-Based Association Analysis

How Much Stratification in a Population is Too Much for Association Analysis?

Measuring the level/amount of subpopulations/stratifications:
Wright's F-statistics or Wright's Fst value.

allele-a allele-A genotype aA
subpopulation 1 p1 q1 2p1 q1
subpopulation 2 p2 q2 2p2 q2
combined p (e.g. p=(p1+p2)/2 ) q e.g., p1q1+p2q2 (not equal to 2pq!)
E.g., two subpopulation of equal sizes. p1=0.1, p2=0.2, p = (0.1+0.2)/2=0.15. The heterozygosity (frequency of the heterozygous aA) is actually (0.18+0.32)/2=0.25, but if calculated assuming a homogeneous population, 2pq= 2*0.15*0.85=0.255.

Subpopulation structure increases the homozygosity frequency, but decrease the heterozygosity frequency. (Inbreeding has the similar effect.)
Fst is defined as 1- 0.25/0.255 = 0.01960784...

Another example: p1=0.1, p2=0.4, p = (0.1+0.4)/2= 0.25. The heterozygosity considering the subpopulation structure is (0.18+ 0.48)/2=0.33, whereas the heterozygosity without considering the subpopulation is 2*0.25*0.75=0.375.
Fst = 1- 0.33/0.375= 0.12.

If allele "a" remains as the minor allele, the maximum Fst is 0.33333...: p1=0, p2=0.5, p = (0.1+0.4)/2= 0.25. The heterozygosity considering the subpopulation structure is (0+ 0.5)/2=0.25, whereas the heterozygosity without considering the subpopulation is 2*0.25*0.75=0.375. And Fst = 1- 0.25/0.375= 1-2/3=1/3.

If the allele "a" is allowed to switch from a minor allele in one subpopulation to a major allele in another subpopulation, maximum Fst can be 1.

Examples: three subpopulations: European-Americans, African-Americans, Asians of Japanese/Chinese ancestry: Fst =0.145.
two subpopulations: Japanese and Chinese Fst =0.013.
33 Irish counties : Fst =0.0132.
27 Finnish districts : Fst =0.005.
11 Icelandic regions : Fst =0.00338, 0.00048, 0.00017, 0.00137.

[J Marchini, LR Cardon, MS Phillips, P Donnelly (2004), "The effects of human population structure on large genetic association studies", Nature Genetics, 36:512-517. PDF
A Helgason, B Yngvadóttir, B Hrafnkelsson, J Gulcher, KStefánsson (2005), "An Icelandic example of the impact of population structure on association studies", Nature Genetics, 37:90-95. PDF ]

"Isolated" Population Has Lower Locus and Allelic Heterogeneity

Different ways to describe an isolated population:

The foundamental formula for linkage disequilibrium in an isolated population: Suppose the probability that there is a recombination between the two positions is theta ( recombination fraction). Allele a, A at position 1, allele b, B at position 2. Then after one generation (from generation t to generation t+1):
pa-b (t+1) = pa-b (t) (1-theta) + theta* pa pb
pa-b (t+1) -pa pb = pa-b (t) (1-theta) + theta* pa pb -pa pb
Da-b (t+1) = (1-theta) Da-b (t)
Da-b (t+1) = (1-theta)2 Da-b (t-1)
Da-b (t+1) = (1-theta)t Da-b (1)
Da-b (t+1) ~ (1-theta)t
If the linkage equilibrium is reached between two positions (D=0), it will remain so. On the other hand, if there is a starting D>0 linkage disequilibrium, it will be reduced by constant factor after each generation.
[COMMENT: note that only D appears in this formula. D', r2, .... wouldn't have this nice property. ]

Gene-Environment Interaction Can Also Be Considered a Heterogeneity Issue

Why? Suppose a disease susceptibility gene has an effect only if an environmental factor is present (e.g. smoking):

smoking not smoking
gene mutation present + -
gene mutation absent - -
We can carry out case-control analysis in two stratified datasets: (1) smoking cases vs smoking controls; (2) non-smoking cases vs no-smoking controls. We are expected to find association signal in the first dataset, but not in the second dataset.

The simplest situation ("hydrogene atom") for gene-environment interaction is the 2-by-2-by-3 table:

aa aA AA
cases/smoking . . .
controls/smoking . . .
aa aA AA
cases/non-smoking . . .
controls/non-smoking . . .
Same thing can be constructed for alleles ( 2-by-2-by-2 table):
a A
cases/smoking . .
controls/smoking . .
a A
cases/non-smoking . .
controls/non-smoking . .
If the odd-ratio from the first 2-by-2 table is different from the odd-ration from the second 2-by-2 table, there is an interaction between the marker and the environmental factor.

See: LD Botto, MJ Khoury (2001), "Facing the challenge of gene-environment interaction: the two-by-four table and beyond", American Journal of Epidemiology, 153:1016-1020. PDF