Our data concerns a study of coronary artery disease (CAD), containing 1,401 study individuals with genotype information across 861,473 single nucleotide polymorphisms (SNPs) [2]. The individuals were surveyed between July 1998 and March 2003, where the case-control study was based on European ancestry severe angiographic CAD status. We provide a pre-processed dataset in the form of .csv
files for:
Below is a list of the covariates for you to analyse (refer to the appendix for explanations for the biological concepts).
Features |
---|
High-density lipoprotein (HDL) |
Low-density lipoprotein (LDL) |
Triglycerdies (TG) |
Coronary artery disease (CAD) |
Sex |
Age |
\(6 \times 10^{5}+\) different SNPs |
The aim of our study is to determine the causal significance of certain single nucleotide polymorphisms (SNPs) attached to the chromosomes to standard measures and occurrence of coronary artery disease (CAD), via a well-known method used in computational biology called “Genome-wide Association Study” (GWAS). In particular, we present and use the methods described by a well-known paper by Reed et al. [2].
We will judge you by the final quality of your report: interpretability of your model and the clarity of your explanation in the final report are important.
You should output your \(p\)-values or any findings into a csv file and upload it, together with all the codes used, onto the designated folder, and then zip
it.
Building a linear model \[ HDL_{i} = \alpha_{i} + x_{SNP_{k}}\beta_{i,1} + x_{LDL_{i}}\beta_{i, 2} + x_{CAD}\beta_{i, 3} + x_{SEX}\beta_{i, 4} + x_{TG}\beta_{i, 5} + \sum_{k=1}^{9}PC_{k}\beta_{i, k} + \epsilon_{i}, \] with \(\epsilon_{i}\stackrel{i.i.d.}{\sim} F\) for some probability distribution \(F\). With HDL as the response, and adding 1 SNP and 9 principal components each time, we obtain the below results summarised by a Manhattan plot with \(F\) being Gaussian.
We see that there are plenty of SNPs that are statistically interesting. For example, the SNP rs1532625 surpasses the candidate threshold. As a result, this result provides important insights for lab biologists to conduct further lab work on this particular chromosome and SNP.
A way to assess model fit is to plot a Q-Q plot of the observed vs expected \(\chi^{2}\) statistics of our models. The left graphs illustrate it when we do include the confounders, the 9 principle components, and the right graphs illustrate when we do not include them. For the HDL model, we can see that the right tails for both plots are heavy, as they deviate from the diagonal lines. As suggested by Reed et al.[2], since the points are mostly along the diagonal line we probably do not have systemic bias. Furthermore, the heavy tails also suggest some degree of association.
Another way is to use the \(\lambda\)-statistics, where a value close to 1 suggests adjustments for possible substructure [2].
Main Task: Study the SNPs that are likely to cause measures such as triglycerides, low-density lipoprotein (LDL) cholesterol and high-density lipoprotein (HDL) cholesterol to change, and which SNPs seem to be linked to the cardiovascular diseases the most. In particular, use linear and logistic regressions. Be careful with how you interpret the \(p\)-values (adjust for Bonferonni correction).
Bonus: Other methods are also possible and it is up to you to explore these options.
We give definitions to a couple of key biological concepts:
Demonstrations above are done using \(R\). However, any scientific computing language would also work.
[1] T. Thornton, “Identity by Descent, the Kinship Coefficient, and the Coefficient of Fraternity,” Summer Institute in Statistical Genetics 2013, 2013.
[2] E. Reed, S. Nunez, D. Kulp, J. Qian, M. P. Reilly, and A. S. Foulkes, “A guide to genome-wide association analysis and post-analytic interrogation,” Statistics in Medicine, 2015.