Content and Source of Data

Our data concerns a study of coronary artery disease (CAD), containing 1,401 study individuals with genotype information across 861,473 single nucleotide polymorphisms (SNPs) [2]. The individuals were surveyed between July 1998 and March 2003, where the case-control study was based on European ancestry severe angiographic CAD status. We provide a pre-processed dataset in the form of .csv files for:

Below is a list of the covariates for you to analyse (refer to the appendix for explanations for the biological concepts).

High-density lipoprotein (HDL)
Low-density lipoprotein (LDL)
Triglycerdies (TG)
Coronary artery disease (CAD)
\(6 \times 10^{5}+\) different SNPs


The aim of our study is to determine the causal significance of certain single nucleotide polymorphisms (SNPs) attached to the chromosomes to standard measures and occurrence of coronary artery disease (CAD), via a well-known method used in computational biology called “Genome-wide Association Study” (GWAS). In particular, we present and use the methods described by a well-known paper by Reed et al. [2].

We will judge you by the final quality of your report: interpretability of your model and the clarity of your explanation in the final report are important.

You should output your \(p\)-values or any findings into a csv file and upload it, together with all the codes used, onto the designated folder, and then zip it.

Overview (An example)

Building a linear model \[ HDL_{i} = \alpha_{i} + x_{SNP_{k}}\beta_{i,1} + x_{LDL_{i}}\beta_{i, 2} + x_{CAD}\beta_{i, 3} + x_{SEX}\beta_{i, 4} + x_{TG}\beta_{i, 5} + \sum_{k=1}^{9}PC_{k}\beta_{i, k} + \epsilon_{i}, \] with \(\epsilon_{i}\stackrel{i.i.d.}{\sim} F\) for some probability distribution \(F\). With HDL as the response, and adding 1 SNP and 9 principal components each time, we obtain the below results summarised by a Manhattan plot with \(F\) being Gaussian.

We see that there are plenty of SNPs that are statistically interesting. For example, the SNP rs1532625 surpasses the candidate threshold. As a result, this result provides important insights for lab biologists to conduct further lab work on this particular chromosome and SNP.

A way to assess model fit is to plot a Q-Q plot of the observed vs expected \(\chi^{2}\) statistics of our models. The left graphs illustrate it when we do include the confounders, the 9 principle components, and the right graphs illustrate when we do not include them. For the HDL model, we can see that the right tails for both plots are heavy, as they deviate from the diagonal lines. As suggested by Reed et al.[2], since the points are mostly along the diagonal line we probably do not have systemic bias. Furthermore, the heavy tails also suggest some degree of association.

Another way is to use the \(\lambda\)-statistics, where a value close to 1 suggests adjustments for possible substructure [2].

The Project

Main Task: Study the SNPs that are likely to cause measures such as triglycerides, low-density lipoprotein (LDL) cholesterol and high-density lipoprotein (HDL) cholesterol to change, and which SNPs seem to be linked to the cardiovascular diseases the most. In particular, use linear and logistic regressions. Be careful with how you interpret the \(p\)-values (adjust for Bonferonni correction).

Bonus: Other methods are also possible and it is up to you to explore these options.

Appendix: Biological Background

We give definitions to a couple of key biological concepts:

  1. Low-density lipoprotein (LDL): There are 2 types of lipoproteins that cholesterols in your blood flows on. LDL makes up most of the body’s cholesterols, and high levels raises your risk of heart disease and stroke.
  2. High-density lipoprotein (HDL): HDL absorbs cholesterols and carries it back to the liver. High levels can have the converse effects of LDL.
  3. Triglycerides: Triglycerides are a type of lipid found in the human blood. They are usually stored in the fat cells. Triglycerides store unused calories and provide the human body with energy, whereas cholesterols are used to build cells and some hormones. Often high Triglycerides are associated with other conditions that increase the risk of heart disease.
  4. Coronary artery disease (CAD): This is a condition in which the major blood vessels in the body, such as the coronary arteries, become damaged. This can lead to heart attacks or strokes.
  5. Angiography (angiography): Coronary angiography is a method used to use the X-ray to see the heart’s vessels inside a person’s body.


Demonstrations above are done using \(R\). However, any scientific computing language would also work.


[1] T. Thornton, “Identity by Descent, the Kinship Coefficient, and the Coefficient of Fraternity,” Summer Institute in Statistical Genetics 2013, 2013.

[2] E. Reed, S. Nunez, D. Kulp, J. Qian, M. P. Reilly, and A. S. Foulkes, “A guide to genome-wide association analysis and post-analytic interrogation,” Statistics in Medicine, 2015.