|
LECTURE 5: PRACTICAL ISSUES (II): HETEROGENEITY
|
The Concept of Population
In population genetics, a population is usually related to
random mating. Random mating leads to homogeneity. Homogeneity
makes it possible to define genotype and allele frequencies.
In Sewall Wright's four-volume book, Evolution and the
Genetics of Population, Vol 2 is called "The Theory of
Gene Frequencies" (1969).
Different Types of Heterogeneity in Association Analysis
Inhomogeneity (heterogeneity) is one the two major difficutlies
in association analysis.
Various types of heterogeneity in case group:
-
Locus heterogeneity: genes at more than one chromosomal locations
are related to the disease
This is a problem for both linkage analysis and association
analysis
-
Allelic heterogeneity: even if only one gene is related to
the disease, different forms of mutation at different
within-the-gene locations damage the gene.
This is not a problem for linkage analysis, but a problem for association
analysis
See:
JD Terwilliger, KM Weiss (1998), "Linkage disequilibrium mapping and
complex disease: fantasy or reality?", Current Opinion in Biotechnology,
9(6):578-594. PDF
JK Pritchard, NJ Cox (2002), "The allelic architecture of human disease
genes: common disease-common variant... or not?", Human Molecular Genetics,
11:2417-2423. PDF
Various types of heterogeneity in control group:
-
Different ethnic groups, sub-populations, migrations
-
Causes of heterogeneity: genetic drift (allele frequency changes by chance
in an isolated population), mutation (historical event being frozen in
the individual), selection (certain allele takes a larger proportion
of the population for some reason)
What Happens to the Association Analysis When There is a Heterogeneity?
Simpson's Paradox
See:
http://en.wikipedia.org/wiki/Simpson's_paradox
Considering the following two populations:
| |
a |
A |
| case |
60 |
40 |
| control |
9 |
1 |
OR= 0.167, 95%CI = (0.02, 0.127)
| |
a |
A |
| case |
1 |
9 |
| control |
30 |
70 |
OR= 0.259, 95%CI = (0.03, 2.138)
In both subpopulations, OR is smaller than 1, so allele "A" is
preferred in case group. However, when we combine the two
subpopulations to form one population:
| |
a |
A |
| case |
61 |
49 |
| control |
39 |
71 |
OR = 2.266, 95% CI = (1.317, 3.898)
which prefers allele "a" in the case group.
In general, if severe heterogeneity is not removed, an
association analysis result cannot be trusted.
The heterogeneity in this example is very easy to detect:
the control group allele frequency pa is changed from
90% in population-1 to 30% in population-2.
Similarly, the case group pa is changed
from 60% in population-1 to 10% in population-2.
Solutions to Heterogeneity
-
Do not combine samples from different ethnic groups
(In particularly, Africans tend to have different allele
frequencies as Caucasians)
-
If samples of different ethnic backgrounds are included, there
are program to identify them as being different from the rest
of the samples:
e.g. CHECKHET program:
http://www.mds.qmw.ac.uk/statgen/dcurtis/software.html
-
While individual person from different ethnic group can be separated
("stratified samples"), ancestral components from different ethnic
group in one person cannot be separated ("admixtured samples").
-
Keep the heterogeneity as it is, but "correct" it. e.g. using
unlinked markers to confirm the different proportion of
the "neutral" marker allele frequencies in case and control
group.
JK Pritchard, NA Rosenberg (1999),
"Use of unlinked genetic markers to detect population stratification
in association studies", American Journal of Human Genetics,
65:220-228.
PDF
B Devlin, K Roeder (1999),
"Genomic control for association studies",
Biometrics, 55:997-1004.
JK Pritchard, M Stephens, NA Rosenberg, P Donnelly (2000),
"Association mapping in structured populations",
American Journal of Human Genetics, 67:170-181.
PDF
-
Use the untransmitted allele as a "pseudo control" allele, whereas
the allele actually transmitted to an affected offspring as
the "case allele". This is the so-called "family-based association"
-
Pick samples from an isolated population so that both
the locus heterogeneity and the allelic heterogeneity
in the samples is reduced.
Heterogeneity in Phenotype and Affection Status Definition
The most famous examples are the psychiatric disorders. There
is no other solution but to restrict/purify/stratify the phenotype
definition.
Family-Based Association Analysis
-
Must have an affected offspring and his/her parents
-
If we know one of the parents is not the source of the mutation, one parent is enough.
-
Since usually we do not know which parent contributes to the disease susceptibility
mutation, both parents are usually used: a collection of trios
-
The advantage of family-based association: both the transmitted
"case allele" and the untransmitted "pseudo control allele" are from
the same parent, so same source. There is no population stratification.
[COMMENT: what if the parent him/herself is admixtured?]
-
The disadvantage includes: hard to get parents' DNA; three persons
need genotyping as compared to two persons in a case-control analysis
(more costly).
-
New mathematics: since the transmitted and untransmitted alleles
always come in a pair, it is the matched case-control analysis.
-
For matched case-control analysis, if the case allele is
the same as the control allele, this pair does not
contribute. As a result, only when the parent of the
affected offspring is heterozygous, can this pair be
included in the sample.
How Much Stratification in a Population is
Too Much for Association Analysis?
Measuring the level/amount of subpopulations/stratifications:
Wright's F-statistics or Wright's Fst value.
| |
allele-a |
allele-A |
genotype aA |
| subpopulation 1 |
p1 |
q1 |
2p1 q1 |
| subpopulation 2 |
p2 |
q2 |
2p2 q2 |
| combined |
p (e.g. p=(p1+p2)/2 ) |
q |
e.g., p1q1+p2q2
(not equal to 2pq!)
|
E.g., two subpopulation of equal sizes.
p1=0.1,
p2=0.2,
p = (0.1+0.2)/2=0.15.
The heterozygosity (frequency of the heterozygous
aA) is actually (0.18+0.32)/2=0.25, but
if calculated assuming a homogeneous population,
2pq= 2*0.15*0.85=0.255.
Subpopulation structure increases the homozygosity
frequency, but decrease the heterozygosity
frequency. (Inbreeding has the similar effect.)
Fst is defined as
1- 0.25/0.255 = 0.01960784...
Another example:
p1=0.1,
p2=0.4,
p = (0.1+0.4)/2= 0.25.
The heterozygosity considering the
subpopulation structure is (0.18+ 0.48)/2=0.33,
whereas the heterozygosity without considering
the subpopulation is 2*0.25*0.75=0.375.
Fst = 1- 0.33/0.375= 0.12.
If allele "a" remains as the minor allele, the
maximum Fst is 0.33333...:
p1=0,
p2=0.5,
p = (0.1+0.4)/2= 0.25.
The heterozygosity considering the
subpopulation structure is (0+ 0.5)/2=0.25,
whereas the heterozygosity without considering
the subpopulation is 2*0.25*0.75=0.375.
And Fst = 1- 0.25/0.375= 1-2/3=1/3.
If the allele "a" is allowed to switch from
a minor allele in one subpopulation to a
major allele in another subpopulation,
maximum Fst can be 1.
Examples:
three subpopulations: European-Americans,
African-Americans, Asians of Japanese/Chinese
ancestry: Fst =0.145.
two subpopulations: Japanese and Chinese
Fst =0.013.
33 Irish counties :
Fst =0.0132.
27 Finnish districts :
Fst =0.005.
11 Icelandic regions :
Fst =0.00338, 0.00048, 0.00017, 0.00137.
[J Marchini, LR Cardon, MS Phillips, P Donnelly (2004),
"The effects of human population structure on
large genetic association studies",
Nature Genetics, 36:512-517.
PDF
A Helgason, B Yngvadóttir, B Hrafnkelsson, J Gulcher, KStefánsson (2005),
"An Icelandic example of the impact of population structure on
association studies", Nature Genetics, 37:90-95.
PDF
]
"Isolated" Population Has Lower Locus and Allelic Heterogeneity
Different ways to describe an isolated population:
-
Not many migrations that may introduce a difference
allele frequency
-
Fewer founders
-
The fact that there are fewer founders can be due to the
fact that this population is formed by an expansion
followed by a "bottleneck". Bottleneck is
to reduce the number of founders. Expansion
is an indication that all samples collected can be
traced to the few founders (so expansion by natural
growth, not expansion by immigration).
The foundamental formula for linkage disequilibrium
in an isolated population:
Suppose the probability that there is a recombination
between the two positions is theta ( recombination fraction).
Allele a, A at position 1, allele b, B at position 2.
Then after one generation (from generation t to generation t+1):
pa-b (t+1)
= pa-b (t) (1-theta) + theta* pa pb
pa-b (t+1) -pa pb
= pa-b (t) (1-theta) + theta* pa pb
-pa pb
Da-b (t+1)
= (1-theta) Da-b (t)
Da-b (t+1)
= (1-theta)2 Da-b (t-1)
Da-b (t+1)
= (1-theta)t Da-b (1)
Da-b (t+1) ~ (1-theta)t
If the linkage equilibrium is reached between two positions
(D=0), it will remain so. On the other hand, if there is
a starting D>0 linkage disequilibrium, it will be reduced
by constant factor after each generation.
[COMMENT: note that only D appears in this formula.
D', r2, .... wouldn't have this nice
property.
]
Gene-Environment Interaction Can Also Be Considered
a Heterogeneity Issue
Why? Suppose a disease susceptibility gene has an
effect only if an environmental factor is present
(e.g. smoking):
| |
smoking |
not smoking |
| gene mutation present |
+ |
- |
| gene mutation absent |
- |
- |
We can carry out case-control analysis in two stratified
datasets: (1) smoking cases vs smoking controls;
(2) non-smoking cases vs no-smoking controls. We
are expected to find association signal in the first
dataset, but not in the second dataset.
The simplest situation ("hydrogene atom")
for gene-environment interaction is the 2-by-2-by-3 table:
| |
aa |
aA |
AA |
| cases/smoking |
. |
. |
. |
| controls/smoking |
. |
. |
. |
| |
aa |
aA |
AA |
| cases/non-smoking |
. |
. |
. |
| controls/non-smoking |
. |
. |
. |
Same thing can be constructed for alleles ( 2-by-2-by-2 table):
| |
a |
A |
| cases/smoking |
. |
. |
| controls/smoking |
. |
. |
| |
a |
A |
| cases/non-smoking |
. |
. |
| controls/non-smoking |
. |
. |
If the odd-ratio from the first 2-by-2 table is
different from the odd-ration from the second 2-by-2
table, there is an interaction between
the marker and the environmental factor.
See: LD Botto, MJ Khoury (2001),
"Facing the challenge of gene-environment
interaction: the two-by-four table and
beyond", American Journal of Epidemiology,
153:1016-1020.
PDF