Other Considerations

Gene Expression Profiles

Consideration of an Alternative Question: Genetic Predictors

Rather than creating a predictive model of diagnosis using various types of patient information, we had previously considered developing a model from the detailed biological data available to consider the importance of genes on chromosome 21 in Alzheimer’s Disease. Chromosome 21 is known to be very important in the etiology of AD - for example, over half of those born with an extra copy of Chr21 (a condition known as Down’s Syndrome) will go on to develop Alzheimer’s Disease [1]. While it is commonly said that the link between Chr21 and AD is due to the gene for the Amyloid Precursor Protein (APP) [2], there are many other genes on this chromosome that have also been linked to Alzheimer’s disease [3].

Of the wealth of genetic data that ADNI publishes, we drew primarily on the microarray gene expression profile dataset, which used ~50,000 genetic probes to assess the activity of genes across the genome. The outcome for each patient in the gene profile dataset was determined using the ADNIMerge dataset. To identify only chromosome 21 genes, the Affymetrix gene annotation dataset was used to annotate the gene expression data set with chromosomal location of the target gene for every probe. The combination of these three datasets created the possibility of building a model based on the genes of any chromosome to predict any clinical outcome. The preliminary models (discussed below) suggested little promise of interesting results to be derived from considering gene expression data in isolation, so we ultimately decided to focus on a more general predictive model involving more feature types instead.

To determine the chromosomal location and biological role of any gene, we used this gene annotation database provided by affymetrix. Using this database, we can add columns to the gene expression profile dataset for the chromosomal column and gene name of every gene. From this, we can isolate genes based on their chromosomal location.

Chromosome 21 genes present in gene expression profile dataset:  609

Using the Chromosome 21 Dataset

SubjectID LocusLink Symbol 116_S_1249 037_S_4410 006_S_4153 116_S_1232 099_S_4205 007_S_4467 128_S_0205 003_S_2374 ... 014_S_4668 130_S_0289 141_S_4456 009_S_2381 053_S_4557 073_S_4300 041_S_4014 007_S_0101 Biological Name Chromosome
Gene_PSID
Visit NaN NaN m48 v03 v03 m48 v03 v03 v06 bl ... v03 m60 v03 bl v03 v03 v03 v06 NaN NaN
Phase NaN NaN ADNIGO ADNI2 ADNI2.1 ADNIGO.1 ADNI2.2 ADNI2.3 ADNI2.4 ADNIGO.2 ... ADNI2.443 ADNIGO.293 ADNI2.444 ADNIGO.294 ADNI2.445 ADNI2.446 ADNI2.447 ADNI2.448 Unnamed: 747 NaN
260/280 NaN NaN 2.05 2.07 2.04 2.03 2.01 2.05 1.95 1.99 ... 2.05 1.98 2.09 1.87 2.03 2.11 1.94 2.06 NaN NaN
260/230 NaN NaN 0.55 1.54 2.1 1.52 1.6 1.91 1.47 2.07 ... 2.05 1.65 1.56 1.45 1.33 0.27 1.72 1.35 NaN NaN
RIN NaN NaN 7.7 7.6 7.2 6.8 7.9 7 7.9 7.2 ... 6.5 6.3 6.4 6.6 6.8 6.2 5.8 6.7 NaN NaN
Affy Plate NaN NaN 7 3 6 7 9 4 3 8 ... 6 9 3 8 5 3 1 4 NaN NaN
YearofCollection NaN NaN 2011 2012 2011 2011 2011 2012 2011 2011 ... 2012 2011 2012 2011 2012 2011 2011 2012 NaN NaN
ProbeSet LocusLink Symbol NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11715130_s_at LOC337967 KRTAP6-2 2.424 2.623 2.501 3.103 2.567 2.992 2.249 2.63 ... 3.074 2.763 2.89 3.155 2.526 2.452 2.822 2.651 [KRTAP6-2] keratin associated protein 6-2 21
11715131_s_at LOC337975 KRTAP20-1 1.826 2.306 2.735 2.777 2.897 2.631 2.613 2.481 ... 2.46 2.374 2.074 3.121 2.725 2.21 2.828 2.444 [KRTAP20-1] keratin associated protein 20-1 21
11715144_s_at LOC337974 KRTAP19-7 3.256 3.404 4.112 3.279 3.67 3.279 3.257 3.86 ... 3.856 3.64 3.642 3.772 4.187 3.579 4.542 3.92 [KRTAP19-7] keratin associated protein 19-7 21
11715145_s_at LOC337976 KRTAP20-2 3.529 4.029 4.254 4.094 3.629 3.632 3.614 4.021 ... 4.026 3.808 3.966 4.134 3.881 3.833 4.112 3.945 [KRTAP20-2] keratin associated protein 20-2 21
11715156_s_at LOC337966 KRTAP6-1 3.855 3.9 4.124 4.428 3.947 4.371 4.14 4.129 ... 4.783 3.961 4.035 4.428 3.972 4.208 4.622 4.147 [KRTAP6-1] keratin associated protein 6-1 21
11715157_s_at LOC337969 KRTAP19-2 2 2.162 2.135 2.144 2.144 2.147 1.938 2.27 ... 2.19 2.045 2.545 2.222 2.332 1.998 2.133 2.238 [KRTAP19-2] keratin associated protein 19-2 21
11715158_s_at LOC337971 KRTAP19-4 2.682 2.993 2.778 2.904 2.714 2.672 2.837 2.578 ... 2.617 2.877 2.944 2.612 2.729 2.7 2.837 2.582 [KRTAP19-4] Keratin associated protein 19-4 21

15 rows × 748 columns

From this dataset, we find 13 features corresponding to probes of the APP gene, a major implicated gene in Alzheimer’s Disease.

To analyse the role of different genes on chromosome 21 in the etiology of Alzheimer’s Disease, a good baseline model would be to use these 13 features in a simple baseline model, so we fitted a simple logistic regression to an X data set of these 13 features.

To ensure that imbalance in the data was not an issue, we used Synthetic Minority Over-sampling Technique (SMOTE) using the imblearn package.

The cross-validation and test scores for this baseline model are:

Cross-validation score  0.5798706240487063
Test score  0.5178571428571429

Both appear only marginally better than random. APP gene expression profile appears to be a very poor predictor of AD.

Next, we looked to build a similar simplistic model using all genes on chromosome 21 to see whether there was anything in particular that stood out as interesting.

The results of this are:

Cross-validation score  0.8115677321156773
Test score  0.6965041721563461

These results are slightly more promising, and warrant an investigation into the details of the model: what are the genes with the largest magnitude associated coefficient?

Biggest positive coefficient:
Probe Set ID UniGene ID Alignments Gene Title Gene Symbol Chromosomal Location Entrez Gene SwissProt RefSeq Protein ID RefSeq Transcript ID Gene Ontology Biological Process Gene Ontology Cellular Component Gene Ontology Molecular Function Pathway InterPro Trans Membrane Chromosome
133 11715233_s_at Hs.381214 chr21:47581062-47604373 (-) // 95.94 // q22.3 spermatogenesis and centriole associated 1-like SPATC1L chr21q22.3 84221 Q9H0A9 NP_001136326 /// NP_115637 /// XP_005261245 //... NM_001142854 /// NM_032261 /// XM_005261188 //... --- --- 0005515 // protein binding // inferred from ph... --- IPR029384 // Speriolin, C-terminal // 1.0E-75 ... --- 21
Biggest negative coefficient:
Probe Set ID UniGene ID Alignments Gene Title Gene Symbol Chromosomal Location Entrez Gene SwissProt RefSeq Protein ID RefSeq Transcript ID Gene Ontology Biological Process Gene Ontology Cellular Component Gene Ontology Molecular Function Pathway InterPro Trans Membrane Chromosome
8325 11723425_at Hs.529400 chr21:34697208-34732236 (+) // 82.3 // q22.11 interferon (alpha, beta and omega) receptor 1 IFNAR1 chr21q22.11 3454 P17181 NP_000620 /// XP_005261021 /// XP_011527854 NM_000629 /// XM_005260964 /// XM_011529552 0007166 // cell surface receptor signaling pat... 0005622 // intracellular // traceable author s... 0004904 // interferon receptor activity // inf... --- IPR003961 // Fibronectin type III // 2.1E-35 /... --- 21

Here, the gene with the largest positive coefficient in this model has no identified link with cognitive function, autophagy, protein synthesis, calcium signalling, immune system, inflammation or any other process linked with neurodegenerative disease - and is only moderately expressed in the brain. Moreover, the gene with the strongest negative coefficient has actually been previously been positively linked with amyloidogenesis in mice, contrary to what our model might imply.

This suggests that there may be limited immediate benefit to using gene expression profile data. To further consider this, we plotted the gene expression distribution for all 13 APP gene probes and also the gene probe found to be the strongest predictor for our model - we see no clear difference in distribution between AD and non-AD, implying again that gene expression data may not be as useful as first thought.

/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

png

/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

png

When researching details of the gene expression dataset, it was found the the samples from which the data is derived were taken from blood, not the central nervous system. This means that the gene expression profiles are not indicative of the cellular environment in the brain, as the gene expression profiles of different tissues varies hugely and the central nervous system is separated from the rest of the body by a near-impervious blood-brain-barrier. As such, even though other models could be tested and different chromosomes could be examined, it was decided that the chance of finding valuable data was too low, and as such the group focussed on other aims.

Sources

[1] https://www.nia.nih.gov/health/alzheimers-disease-people-down-syndrome

[2] https://www.alz.org/alzheimers-dementia/what-is-dementia/types-of-dementia/down-syndrome

[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4019841/

https://www.ncbi.nlm.nih.gov/pubmed/18199027

http://www.bbc.com/future/story/20181022-there-is-mounting-evidence-that-herpes-leads-to-alzheimers