LECTURE 3: INTRODUCTION TO GENETIC ASSOCIATION ANALYSIS

Online Material on Association Analysis
  1. http://www.meb.ki.se/genestat/tl/genass_ldmap/index.htm (from Institutionen for Medicinsk Epidemiologi och Biostatistik, SWEDEN).
  2. David Clayton's lecture slides: http://www-gene.cimr.cam.ac.uk/clayton/talks/ (advanced topics)
  3. powerpoint slides by D. Altschuler (if you couldn't see PowerPoint, here is the PDF file)
  4. a one-seminar class on association study at Simon-Fraser University: http://www.stat.sfu.ca/~jgraham/Teaching/S890_04/ (though not much material made available online)
  5. A few HELP pages from Manchester University: http://slack.ser.man.ac.uk/theory/association.html (the more general "genetic analysis" page is at http://slack.ser.man.ac.uk/)
  6. Terwilliger/Weiss' course on "logical reasoning in human genetics": http://www.genomeutwin.org/events/CR_GSTAT_19230904_HKI.html
History
Case-Control Analysis for One Two-Allele Marker
the "hydrogen atom" of association analysis

aa aA AA
cases 10 190 800
controls 3 100 900
a few standard things to do:
  1. testing Hardy-Weinberger equilibrium for both the control and the case group (especially the control group). For example, using the program FINETTI: http://ihg.gsf.de/cgi-bin/hw/hwa1.pl
    [COMMENTS: control group should follow H-W equilibrium, but for the case group, if the marker is indeed close to the disease gene, it may not obey H-W equilibrium. ]
  2. At least two ways to construct a 2-by-2 table for statistical tests:
    1. allele a versus allele A
      allele "a" allele "A"
      case 210 1790
      control 106 1900
      [COMMENTS: this 2-by-2 table effectively doubles the sample size (the number of alleles is twice the number of persons). and it is "cheating". see: PD Sasieni (1997), "From genotypes to genes: doubling the sample size", Biometrics, 53:1253-1261. PDF ]
    2. either combining aa and aA as one columne (dominant model), or, combining aA and AA as one column (recessive model):
      aa+aA AA aa aA+AA
      case 200 800 case 10 990
      control 103 900 control 3 1000
      [COMMENTS: since we do not know which 2-by-2 table should be used, we should test both. then there is a "multi-testing" issue here. ]
  3. Pearson's chi-square test (if you are not familiar with this topic, see any of the following pages: http://www.psychstat.smsu.edu/introbook/sbk28m.htm,
    http://www.answers.com/topic/pearson-s-chi-square-test,
    http://faculty.vassar.edu/lowry/tab2x2.html,
    and many many more).

    p-value=9.1 x 10-10 (allele 2-by-2 table) (or 1.3 x 10-9 if Yates' correction is used)
    p-value=1.2 x 10-9 (genotype, recessive) (or, 1.8 x 10-9 if Yates' correction is used)
    p-value= 0.051 (genotype, dominant) (or 0.094 if Yates' correction is used)
    min(p-value(rec), p-value(dom))= 1.2 x 10-9

    [COMMENTS: an alternative to Pearson's chi-square test is the Fisher's exact test -- more accurate for smaller sample sizes. ]

  4. Odd ratio and 95% confidence interbal of odd ratio: "odd" is the ratio of the number people who possess the "risk" (e.g. allele "a") over the number of people who do not contain the "risk". "odd ratio" (OR) is the ratio of two such ratios, one for the case group, another for the control group.

    for the three 2-by-2 tables:
    OR = (210/1790)/( 106/1900)= 2.10
    OR = (200/800)/( 103/900)= 2.18
    OR = (10/990)/( 3/1000)= 3.47

    [COMMENTS: Odd ratio is an approximation of "relative risk", defined as Prob(affected|risk)/Prob(normal|risk). but due to the case-control sample collection design, the relative risk can never be calculated exactly. ]

    OR value itself is not enough, we need to know the range of OR's (confidence interval). here is a R/SPLUS script for this calculation. This formula is due to: B Woolf (1955), "On estimating the relation between blood and disease", Annals of Human Genetics, 19:251-253. For the above three 2-by-2 tables:

    95%CI: (1.65, 2.68)
    95%CI: (1.69, 2.82)
    95%CI: (0.92, 12.27)
    [COMMENTS: when one bound is smaller than 1 and another larger than 1, the result is not significant at p-value=0.05. ]

From "Hydrogen" to "Oxygen": What to Expect from the Next Two Lectures?

1. If we know which gene is responsible for the disease and if we can "see" the gene, know whether it contains mutation or not, the association analysis is direct. However, the marker we examine may not have a one-to-one correspondence with the mutation status at the disease gene. It has a distance between the disease gene and the linkage disequilibrium may not be complete. The marker is typed in terms of genotype, and "phase" is usually unknown.

2. What we consider a homogeneous group of people from which the control samples are collected may not be homogeneous after all. Same for the case sample population. How do we deal with heterogeneity?