LECTURE 7: VARIOUS STATISTICAL ISSUES IN ASSOCIATION ANALYSIS

Statistical Power

Review

These four quantities inter-related to each other: power (gamma), type-I error (alpha), alternative hypothesis, and sample size.

The decision table (the probability of... given the truth...):

fail to reject null reject null/accept alternative
null is true type-I error, false-positive, alpha
alternative is true type-II error, false-negative, beta ("miss") power, 1-beta
For example, for association analysis:
claim a marker not associated with the disease claim a marker is associated with the disease
there is no association type-I error, false-positive, alpha
there is an association type-II error, false-negative, beta ("miss") power, 1-beta

The figure in this page: http://davidmlane.com/hyperstat/power.html summarizes the main properties of power.

  1. Increasing the "effect size" will increase the power (meaning the more different between the alternative and the null hypothesis, the higher the power).
  2. The more relaxed the significance level, the higher the power (more false positive errors, less false negative errors. and vice versa. Two errors are compensating each other.)
  3. Sample size will increase the power
Power for Case-Control Analysis

See: JM Satagopan, RC Elston (2003), "Optimal two-stage genotyping in population-based association studies", Genetic Epidemiology, 25:149-157. PDF

N: sample size per group (# of case alleles = # control alleles, and total number of alleles is 2N)
z1-alpha: the position/threshold for a standard normal distribution (mean=0, variance=1) that that the tail area from here to infinite is equal to alpha. [ In R, the subroutine is, e.g., qnorm(1-0.025), if the 5% is split into two 2.5% (two-sided test) ]
alpha: type I error
z1-alpha: same thing
beta: type II error (power= 1-beta)
p1, p2: the allele frequencies in two groups in the alternative hypothesis (given).
mu: standardized difference between the two allele frequencies in the alternative hypothesis: mu = (p1 - p2)/sqrt[ p1 (1- p1) + p2 (1- p2) ]

N= (z1-alpha + z1-beta)2/ mu2
Check: indeed, this formula contains all four quantities: sample size, type-I error, power, effect size (alternative hypothesis).

The alternative hypothesis can be given as a disease model: penetrance fdd, fdD, fDD.
Then p1 (case group)= Prob(D|affected)= P(affected|D) P(D)/P(affected) =p(p fDD + q fdD)/(fDD p2 + fdD 2pq + fdd q2)
p2 (control group)= Prob(D|unaffected)= P(unaffected|D) P(D)/P(unaffected) =p(p (1-fDD) + q (1-fdD))/(1-(fDD p2 + fdD 2pq + fdd q2))

Here is a R-script for this calculation.

Even simpler, R has a subroutine power.prop.test() (see a manual page: http://www.maths.lth.se/help/R/.R/library/ctest/html/power.prop.test.html)

Or, the alternative hypothesis can be given as the ratio of two penetrances called relative risk: lambdaDD=fDD/fdd, lambdaDd=fDd/fdd

Or, if the disease model is multiplicative, the two relative risks become one

Power for TDT

Many papers. Here is a recent one: R McGinnis, S Shifman, A Darvasi (2002), "Power and efficiency of the TDT and case-control design for association scans", Behavior Genetics, 32:135-144. PDF

Main conclusion: with the same disease model, the number of cases (or controls) in case-control analysis is similar to the number of trios in TDT test. So overall, TDT requires more samples than case-control (the number of persons to be typed is 3:2).

Some computer programs for TDT power calculation:
http://www.biostat.jhsph.edu/~wmchen/pc.html
http://statgen.iop.kcl.ac.uk/gpc/dtdt.html
http://www.biostat.harvard.edu/~clange/default.htm
http://odin.mdacc.tmc.edu/biomath/anonftp/page_3.html

Multiple Testing

Suppose we are testing association of a marker with the disease: it is one test. Then we are testing 1,000,000 markers with the disease (1 million tests).

Even if none of the marker is truly associated with the disease, since under the null, p-values are uniformly distributed, on average, 50000 markers should have p-value between (0, 0.05), 10000 markers between (0, 0.01), and 1000 markers have p-values < 0.001.

With a fixed p-value threshold, the more markers, the more likely that we will get "significant" test results by chance alone.

There are at least three solutions:

1. Bonferroni correction If n tests are independent, multiply each p-value by n (if the resulting value is larger than 1, set it to 1). Alternatively, we say the single-test significance level alpha is equivalent to the multiple-test significance level alpha/n (more stringent).
See, however, e.g., Perneger (1998), "What's wrong with Bonferroni adjustments", BMJ, 316:1236-1238. http://bmj.bmjjournals.com/cgi/content/full/316/7139/1236

2. Plotting the histogram of p-values This maybe the best solution: if the null model is correct, and the only problem is that the number of markers (thus the number of tests) is huge, then plotting the histogram would reveal whether the null is correct (if the histogram is flat) or not (if there is a peak towards the p-value=0 border).

3. False discovery rate, q-value See the manual page on QVALUE program: http://faculty.washington.edu/~jstorey/qvalue/manual.pdf

Switching the condition and consequence in the conditional probability:

p-value = Prob( test statistic is equal or larger than observed | null hypothesis is true)
q-value = Prob( null hypothesis is true | test statistic is equal or larger than observed)

This topic is relevant to genome-wide association because thousands, tens of thousands of SNP will be used for association analysis. Note, however, that neighboring SNPs are not independent, while the multiple test assumes each test is an independent one.

Likelihood Function, Maximum Likelihood

Although the concept of likelihood is used more often in linkage analysis than in association analysis, it is useful to understand the basics, since there are association methods that are based in this framework.

Likelihood function is a statistical model of the data, disregard the constant coefficient (normalization factor): L ~ f(data, parameters). For example, to model the data of tossing coin n times, with nH heads, and nT tails, the likelihood function is L ~ pnH (1-p)nT, where p is a parameter in the model. The meaning of p is the probability to have a head.

The parameter in the likelihood function can be estimated from the data. Usually the estimation is such that at the estimated parameter values, the likelihood function reaches its maximum value. The estimated value is then the maximum likelihood estimation of the parameter, and the likelihood function evaluated at these estimated parameter values is the maximum likelihood. For example, the maximum likelihood estimation of p is nH/n. and the maximum likelihood is (nH/n)nH (nT/n)nT.

Taking the logarithm of the maximum likelihood: log(maxL) = nH log nH/n + nT log nT/n = n (nH/n log nH/n + nT/n log nT/n ). The quantity within the () is called entropy.

Maximum Likelihood Ratio Test

Two different statistical models are used to describe the data: one uses one parameter, another uses two parameters. For example, of 100 case samples and 100 controls samples, we may say that the case and control group has the same minor allele frequency (p), we may also say that case group and control has two different minor allele frequencies ( p1 and p2) )

Likelihoods under both models are calculated and maximized by fitting the parameter values: maxL0 (one parameter model, or "null"), max La (two parameter models, or "alternative").

The main formula: under the null hypothesis:

2 log ( maxLa/max L0) ~ chi-square distribution (df=1)
With the distribution known, we can do a statistical test.

In general, if there are N free parameters in the alternative model (the more complicated model) and M free parameters in the null model (the simpler model), then the degree of freedom in the chi-square distribution is N-M.

Who proposed the likelihood ratio test first? SS Wilks (1938), "The large-sample distribution of the likelihood ratio for testing composite hypotheses", Ann Ann. Math. Stat., 9:60-62.

Both EH ( http://linkage.rockefeller.edu/ott/eh.htm) and EH-PLUS programs http://www.hgmp.mrc.ac.uk/~jzhao/software.htm use this approach, though for EH, one has to take the max-likelihood values from the program, then determine the p-value from another statistical program (e.g. R).

Logistic Regression is a Particular Statistical Model

It models the probability for binary outcomes as

p(case) = 1/(1+ ez ), z can be anything, e.g. z= -a -b*age -c*sex -d*drinking
a, b, c... are the parameters in the model, and the likelihood is L ~ pNcase (1-p)Ncontrol

Because it is easy to add more parameters to model more factors, logistic regression model is often used for epidemiology studies. It can also be used to detect interactions.

Who used logistic regression first? "Bartlett (1937) used log (y/(1-y) in regression and ANOVA to transform observations y that are continuous proportions. In a book of statistical tables published in 1938, R.A. Fisher and Frank Yates suggested it as a possible transformation of a binomial parameter for analyzing binary data. In 1944, the physician and statistician Joseph Berkson introduced the term logit for this transformation. Berkson showed that the model using the logit fitted similarly to the probit model, and his subsequent work did much to popularize the logistic regression. "

Using Both Affected Sibs in Case-Control Analysis

Typically, in a case-control analysis, each sample (person) is independent. Relatives, since they are genetically correlated, are not independent samples.

However, case-control analysis is usually carried out after a linkage analysis is done (for the purpose of confirming a linkage signal). And linkage analysis was carried out in pedigrees with multiple affecteds. Pedigrees with two affected sibs are particular popular due to the use the "affected-sib-linkage-analysis".

We propose a procedure which can take into account of the correlation between the two sibs: W Li, Y Yang, PG Gregersen (2005), "Effective sample size method in incorporating the second affected sib in case-control analyses", manuscript.