Consideration of an Alternative Question: Genetic Predictors
Rather than creating a predictive model of diagnosis using various types of patient information, we had previously considered developing a model from the detailed biological data available to consider the importance of genes on chromosome 21 in Alzheimer’s Disease. Chromosome 21 is known to be very important in the etiology of AD - for example, over half of those born with an extra copy of Chr21 (a condition known as Down’s Syndrome) will go on to develop Alzheimer’s Disease [1]. While it is commonly said that the link between Chr21 and AD is due to the gene for the Amyloid Precursor Protein (APP) [2], there are many other genes on this chromosome that have also been linked to Alzheimer’s disease [3].
Of the wealth of genetic data that ADNI publishes, we drew primarily on the microarray gene expression profile dataset, which used ~50,000 genetic probes to assess the activity of genes across the genome. The outcome for each patient in the gene profile dataset was determined using the ADNIMerge dataset. To identify only chromosome 21 genes, the Affymetrix gene annotation dataset was used to annotate the gene expression data set with chromosomal location of the target gene for every probe. The combination of these three datasets created the possibility of building a model based on the genes of any chromosome to predict any clinical outcome. The preliminary models (discussed below) suggested little promise of interesting results to be derived from considering gene expression data in isolation, so we ultimately decided to focus on a more general predictive model involving more feature types instead.
To determine the chromosomal location and biological role of any gene, we used this gene annotation database provided by affymetrix. Using this database, we can add columns to the gene expression profile dataset for the chromosomal column and gene name of every gene. From this, we can isolate genes based on their chromosomal location.
Chromosome 21 genes present in gene expression profile dataset: 609
SubjectID | LocusLink | Symbol | 116_S_1249 | 037_S_4410 | 006_S_4153 | 116_S_1232 | 099_S_4205 | 007_S_4467 | 128_S_0205 | 003_S_2374 | ... | 014_S_4668 | 130_S_0289 | 141_S_4456 | 009_S_2381 | 053_S_4557 | 073_S_4300 | 041_S_4014 | 007_S_0101 | Biological Name | Chromosome |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gene_PSID | |||||||||||||||||||||
Visit | NaN | NaN | m48 | v03 | v03 | m48 | v03 | v03 | v06 | bl | ... | v03 | m60 | v03 | bl | v03 | v03 | v03 | v06 | NaN | NaN |
Phase | NaN | NaN | ADNIGO | ADNI2 | ADNI2.1 | ADNIGO.1 | ADNI2.2 | ADNI2.3 | ADNI2.4 | ADNIGO.2 | ... | ADNI2.443 | ADNIGO.293 | ADNI2.444 | ADNIGO.294 | ADNI2.445 | ADNI2.446 | ADNI2.447 | ADNI2.448 | Unnamed: 747 | NaN |
260/280 | NaN | NaN | 2.05 | 2.07 | 2.04 | 2.03 | 2.01 | 2.05 | 1.95 | 1.99 | ... | 2.05 | 1.98 | 2.09 | 1.87 | 2.03 | 2.11 | 1.94 | 2.06 | NaN | NaN |
260/230 | NaN | NaN | 0.55 | 1.54 | 2.1 | 1.52 | 1.6 | 1.91 | 1.47 | 2.07 | ... | 2.05 | 1.65 | 1.56 | 1.45 | 1.33 | 0.27 | 1.72 | 1.35 | NaN | NaN |
RIN | NaN | NaN | 7.7 | 7.6 | 7.2 | 6.8 | 7.9 | 7 | 7.9 | 7.2 | ... | 6.5 | 6.3 | 6.4 | 6.6 | 6.8 | 6.2 | 5.8 | 6.7 | NaN | NaN |
Affy Plate | NaN | NaN | 7 | 3 | 6 | 7 | 9 | 4 | 3 | 8 | ... | 6 | 9 | 3 | 8 | 5 | 3 | 1 | 4 | NaN | NaN |
YearofCollection | NaN | NaN | 2011 | 2012 | 2011 | 2011 | 2011 | 2012 | 2011 | 2011 | ... | 2012 | 2011 | 2012 | 2011 | 2012 | 2011 | 2011 | 2012 | NaN | NaN |
ProbeSet | LocusLink | Symbol | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
11715130_s_at | LOC337967 | KRTAP6-2 | 2.424 | 2.623 | 2.501 | 3.103 | 2.567 | 2.992 | 2.249 | 2.63 | ... | 3.074 | 2.763 | 2.89 | 3.155 | 2.526 | 2.452 | 2.822 | 2.651 | [KRTAP6-2] keratin associated protein 6-2 | 21 |
11715131_s_at | LOC337975 | KRTAP20-1 | 1.826 | 2.306 | 2.735 | 2.777 | 2.897 | 2.631 | 2.613 | 2.481 | ... | 2.46 | 2.374 | 2.074 | 3.121 | 2.725 | 2.21 | 2.828 | 2.444 | [KRTAP20-1] keratin associated protein 20-1 | 21 |
11715144_s_at | LOC337974 | KRTAP19-7 | 3.256 | 3.404 | 4.112 | 3.279 | 3.67 | 3.279 | 3.257 | 3.86 | ... | 3.856 | 3.64 | 3.642 | 3.772 | 4.187 | 3.579 | 4.542 | 3.92 | [KRTAP19-7] keratin associated protein 19-7 | 21 |
11715145_s_at | LOC337976 | KRTAP20-2 | 3.529 | 4.029 | 4.254 | 4.094 | 3.629 | 3.632 | 3.614 | 4.021 | ... | 4.026 | 3.808 | 3.966 | 4.134 | 3.881 | 3.833 | 4.112 | 3.945 | [KRTAP20-2] keratin associated protein 20-2 | 21 |
11715156_s_at | LOC337966 | KRTAP6-1 | 3.855 | 3.9 | 4.124 | 4.428 | 3.947 | 4.371 | 4.14 | 4.129 | ... | 4.783 | 3.961 | 4.035 | 4.428 | 3.972 | 4.208 | 4.622 | 4.147 | [KRTAP6-1] keratin associated protein 6-1 | 21 |
11715157_s_at | LOC337969 | KRTAP19-2 | 2 | 2.162 | 2.135 | 2.144 | 2.144 | 2.147 | 1.938 | 2.27 | ... | 2.19 | 2.045 | 2.545 | 2.222 | 2.332 | 1.998 | 2.133 | 2.238 | [KRTAP19-2] keratin associated protein 19-2 | 21 |
11715158_s_at | LOC337971 | KRTAP19-4 | 2.682 | 2.993 | 2.778 | 2.904 | 2.714 | 2.672 | 2.837 | 2.578 | ... | 2.617 | 2.877 | 2.944 | 2.612 | 2.729 | 2.7 | 2.837 | 2.582 | [KRTAP19-4] Keratin associated protein 19-4 | 21 |
15 rows × 748 columns
From this dataset, we find 13 features corresponding to probes of the APP gene, a major implicated gene in Alzheimer’s Disease.
To analyse the role of different genes on chromosome 21 in the etiology of Alzheimer’s Disease, a good baseline model would be to use these 13 features in a simple baseline model, so we fitted a simple logistic regression to an X data set of these 13 features.
To ensure that imbalance in the data was not an issue, we used Synthetic Minority Over-sampling Technique (SMOTE) using the imblearn package.
The cross-validation and test scores for this baseline model are:
Cross-validation score 0.5798706240487063
Test score 0.5178571428571429
Both appear only marginally better than random. APP gene expression profile appears to be a very poor predictor of AD.
Next, we looked to build a similar simplistic model using all genes on chromosome 21 to see whether there was anything in particular that stood out as interesting.
The results of this are:
Cross-validation score 0.8115677321156773
Test score 0.6965041721563461
These results are slightly more promising, and warrant an investigation into the details of the model: what are the genes with the largest magnitude associated coefficient?
Biggest positive coefficient:
Probe Set ID | UniGene ID | Alignments | Gene Title | Gene Symbol | Chromosomal Location | Entrez Gene | SwissProt | RefSeq Protein ID | RefSeq Transcript ID | Gene Ontology Biological Process | Gene Ontology Cellular Component | Gene Ontology Molecular Function | Pathway | InterPro | Trans Membrane | Chromosome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
133 | 11715233_s_at | Hs.381214 | chr21:47581062-47604373 (-) // 95.94 // q22.3 | spermatogenesis and centriole associated 1-like | SPATC1L | chr21q22.3 | 84221 | Q9H0A9 | NP_001136326 /// NP_115637 /// XP_005261245 //... | NM_001142854 /// NM_032261 /// XM_005261188 //... | --- | --- | 0005515 // protein binding // inferred from ph... | --- | IPR029384 // Speriolin, C-terminal // 1.0E-75 ... | --- | 21 |
Biggest negative coefficient:
Probe Set ID | UniGene ID | Alignments | Gene Title | Gene Symbol | Chromosomal Location | Entrez Gene | SwissProt | RefSeq Protein ID | RefSeq Transcript ID | Gene Ontology Biological Process | Gene Ontology Cellular Component | Gene Ontology Molecular Function | Pathway | InterPro | Trans Membrane | Chromosome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8325 | 11723425_at | Hs.529400 | chr21:34697208-34732236 (+) // 82.3 // q22.11 | interferon (alpha, beta and omega) receptor 1 | IFNAR1 | chr21q22.11 | 3454 | P17181 | NP_000620 /// XP_005261021 /// XP_011527854 | NM_000629 /// XM_005260964 /// XM_011529552 | 0007166 // cell surface receptor signaling pat... | 0005622 // intracellular // traceable author s... | 0004904 // interferon receptor activity // inf... | --- | IPR003961 // Fibronectin type III // 2.1E-35 /... | --- | 21 |
Here, the gene with the largest positive coefficient in this model has no identified link with cognitive function, autophagy, protein synthesis, calcium signalling, immune system, inflammation or any other process linked with neurodegenerative disease - and is only moderately expressed in the brain. Moreover, the gene with the strongest negative coefficient has actually been previously been positively linked with amyloidogenesis in mice, contrary to what our model might imply.
This suggests that there may be limited immediate benefit to using gene expression profile data. To further consider this, we plotted the gene expression distribution for all 13 APP gene probes and also the gene probe found to be the strongest predictor for our model - we see no clear difference in distribution between AD and non-AD, implying again that gene expression data may not be as useful as first thought.
/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
When researching details of the gene expression dataset, it was found the the samples from which the data is derived were taken from blood, not the central nervous system. This means that the gene expression profiles are not indicative of the cellular environment in the brain, as the gene expression profiles of different tissues varies hugely and the central nervous system is separated from the rest of the body by a near-impervious blood-brain-barrier. As such, even though other models could be tested and different chromosomes could be examined, it was decided that the chance of finding valuable data was too low, and as such the group focussed on other aims.
[1] https://www.nia.nih.gov/health/alzheimers-disease-people-down-syndrome
[2] https://www.alz.org/alzheimers-dementia/what-is-dementia/types-of-dementia/down-syndrome
[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4019841/
https://www.ncbi.nlm.nih.gov/pubmed/18199027
http://www.bbc.com/future/story/20181022-there-is-mounting-evidence-that-herpes-leads-to-alzheimers