| LECTURE 7: VARIOUS STATISTICAL ISSUES IN ASSOCIATION ANALYSIS |
Review
The decision table (the probability of... given the truth...):
| fail to reject null | reject null/accept alternative | |
| null is true | type-I error, false-positive, alpha | |
| alternative is true | type-II error, false-negative, beta ("miss") | power, 1-beta |
| claim a marker not associated with the disease | claim a marker is associated with the disease | |
| there is no association | type-I error, false-positive, alpha | |
| there is an association | type-II error, false-negative, beta ("miss") | power, 1-beta |
The figure in this page: http://davidmlane.com/hyperstat/power.html summarizes the main properties of power.
See: JM Satagopan, RC Elston (2003), "Optimal two-stage genotyping in population-based association studies", Genetic Epidemiology, 25:149-157. PDF
N: sample size per group (# of case alleles = # control alleles, and total
number of alleles is 2N)
The alternative hypothesis can be given as a disease
model: penetrance
fdd,
fdD,
fDD.
Here is a
R-script for this calculation.
Even simpler, R has a subroutine
power.prop.test() (see a
manual page:
http://www.maths.lth.se/help/R/.R/library/ctest/html/power.prop.test.html)
Or, the alternative hypothesis can be given
as the ratio of two penetrances called relative
risk: lambdaDD=fDD/fdd,
lambdaDd=fDd/fdd
Or, if the disease model is multiplicative, the two
relative risks become one
Many papers. Here is a recent one:
R McGinnis, S Shifman, A Darvasi (2002),
"Power and efficiency of the TDT and case-control design for
association scans", Behavior Genetics, 32:135-144.
PDF
Main conclusion: with the same disease model, the number of
cases (or controls) in case-control analysis is similar
to the number of trios in TDT test. So overall, TDT requires
more samples than case-control (the number of persons
to be typed is 3:2).
Some computer programs for TDT power calculation:
Suppose we are testing association of a marker with the
disease: it is one test. Then we are testing 1,000,000
markers with the disease (1 million tests).
Even if none of the marker is truly associated with
the disease, since under the null, p-values are
uniformly distributed, on average, 50000 markers
should have p-value between (0, 0.05), 10000
markers between (0, 0.01), and 1000 markers have
p-values < 0.001.
With a fixed p-value threshold, the more markers,
the more likely that we will get "significant"
test results by chance alone.
There are at least three solutions:
1. Bonferroni correction If n
tests are independent, multiply each p-value by n
(if the resulting value is larger than 1, set it to 1).
Alternatively, we say the single-test significance level
alpha is equivalent to the multiple-test significance
level alpha/n (more stringent).
2. Plotting the histogram of p-values This maybe
the best solution: if the null model is correct, and
the only problem is that the number of markers (thus
the number of tests) is huge, then plotting the histogram
would reveal whether the null is correct (if the histogram
is flat) or not (if there is a peak towards the p-value=0
border).
3. False discovery rate, q-value
See the manual page on
QVALUE program:
http://faculty.washington.edu/~jstorey/qvalue/manual.pdf
Switching the condition and consequence in the conditional
probability:
This topic is relevant to genome-wide association
because thousands, tens of thousands of SNP will
be used for association analysis. Note, however,
that neighboring SNPs are not independent, while
the multiple test assumes each test is an independent
one.
Although the concept of likelihood is used more often
in linkage analysis than in association analysis, it
is useful to understand the basics, since there are
association methods that are based in this framework.
Likelihood function is a statistical model of the
data, disregard the constant coefficient (normalization
factor): L ~ f(data, parameters).
For example, to model the data of tossing coin
n times, with nH heads,
and nT tails, the likelihood function
is L ~ pnH (1-p)nT,
where p is a parameter in the model. The meaning
of p is the probability to have a head.
The parameter in the likelihood function can
be estimated from the data. Usually the estimation
is such that at the estimated parameter values,
the likelihood function reaches its maximum value.
The estimated value is then the
maximum likelihood estimation of the parameter,
and the likelihood function evaluated at these
estimated parameter values is the maximum
likelihood. For example, the maximum likelihood
estimation of p is nH/n. and the maximum
likelihood is (nH/n)nH
(nT/n)nT.
Taking the logarithm of the maximum likelihood:
log(maxL) =
nH log nH/n +
nT log nT/n
= n
(nH/n log nH/n +
nT/n log nT/n ).
The quantity within the () is called entropy.
Two different statistical models are used to describe the
data: one uses one parameter, another uses two parameters.
For example, of 100 case samples and
100 controls samples, we may say that the case and
control group has the same minor allele frequency (p),
we may also say that case group and control has two
different minor allele frequencies ( p1 and
p2) )
Likelihoods under both models are calculated and
maximized by fitting the parameter values: maxL0
(one parameter model, or "null"),
max La (two parameter models, or "alternative").
The main formula: under the null hypothesis:
In general, if there are N free parameters in the alternative
model (the more complicated model) and M free parameters in the
null model (the simpler model), then the degree of freedom
in the chi-square distribution is N-M.
Who proposed the likelihood ratio test first?
SS Wilks (1938), "The large-sample distribution of the likelihood
ratio for testing composite hypotheses",
Ann Ann. Math. Stat., 9:60-62.
Both EH
(
http://linkage.rockefeller.edu/ott/eh.htm)
and EH-PLUS programs
http://www.hgmp.mrc.ac.uk/~jzhao/software.htm
use this approach, though for EH, one has to take the
max-likelihood values from the program, then determine
the p-value from another statistical program (e.g. R).
It models the probability for binary outcomes as
Because it is easy to add more parameters to model more
factors, logistic regression model is often used for
epidemiology studies. It can also be used to detect
interactions.
Who used logistic regression first?
"Bartlett (1937) used log (y/(1-y) in
regression and ANOVA to transform observations
y that are continuous proportions. In a book of
statistical tables published in 1938, R.A. Fisher
and Frank Yates suggested it as a possible transformation
of a binomial parameter for analyzing binary data.
In 1944, the physician and statistician Joseph Berkson
introduced the term logit for this transformation.
Berkson showed that the model using the logit fitted
similarly to the probit model, and his subsequent work did
much to popularize the logistic regression. "
Typically, in a case-control analysis, each sample (person)
is independent. Relatives, since they are genetically correlated,
are not independent samples.
However, case-control analysis is usually carried out
after a linkage analysis is done (for the purpose of
confirming a linkage signal). And linkage analysis was
carried out in pedigrees with multiple affecteds.
Pedigrees with two affected sibs are particular popular
due to the use the "affected-sib-linkage-analysis".
We propose a procedure which can take into account of
the correlation between the two sibs:
W Li, Y Yang, PG Gregersen (2005),
"Effective sample size method in incorporating
the second affected sib in case-control analyses",
manuscript.
z1-alpha: the position/threshold for
a standard normal distribution (mean=0, variance=1)
that that the tail area from here to infinite is equal
to alpha.
[
In R, the subroutine is, e.g., qnorm(1-0.025),
if the 5% is split into two 2.5% (two-sided test)
]
alpha: type I error
z1-alpha: same thing
beta: type II error (power= 1-beta)
p
mu: standardized difference between the two allele
frequencies in the alternative hypothesis:
mu = (p
Then p1 (case group)= Prob(D|affected)= P(affected|D) P(D)/P(affected)
=p(p fDD + q fdD)/(fDD p2
+ fdD 2pq + fdd q2)
p2 (control group)= Prob(D|unaffected)=
P(unaffected|D) P(D)/P(unaffected)
=p(p (1-fDD) + q (1-fdD))/(1-(fDD p2
+ fdD 2pq + fdd q2))
http://www.biostat.jhsph.edu/~wmchen/pc.html
http://statgen.iop.kcl.ac.uk/gpc/dtdt.html
http://www.biostat.harvard.edu/~clange/default.htm
http://odin.mdacc.tmc.edu/biomath/anonftp/page_3.html
See, however, e.g.,
Perneger (1998), "What's wrong with Bonferroni adjustments",
BMJ, 316:1236-1238.
http://bmj.bmjjournals.com/cgi/content/full/316/7139/1236
q-value = Prob( null hypothesis is true | test statistic
is equal or larger than observed)