Other Considerations

Gene Expression Profiles

Consideration of an Alternative Question: Genetic Predictors

Rather than creating a predictive model of diagnosis using various types of patient information, we had previously considered developing a model from the detailed biological data available to consider the importance of genes on chromosome 21 in Alzheimer’s Disease. Chromosome 21 is known to be very important in the etiology of AD - for example, over half of those born with an extra copy of Chr21 (a condition known as Down’s Syndrome) will go on to develop Alzheimer’s Disease [1]. While it is commonly said that the link between Chr21 and AD is due to the gene for the Amyloid Precursor Protein (APP) [2], there are many other genes on this chromosome that have also been linked to Alzheimer’s disease [3].

Of the wealth of genetic data that ADNI publishes, we drew primarily on the microarray gene expression profile dataset, which used ~50,000 genetic probes to assess the activity of genes across the genome. The outcome for each patient in the gene profile dataset was determined using the ADNIMerge dataset. To identify only chromosome 21 genes, the Affymetrix gene annotation dataset was used to annotate the gene expression data set with chromosomal location of the target gene for every probe. The combination of these three datasets created the possibility of building a model based on the genes of any chromosome to predict any clinical outcome. The preliminary models (discussed below) suggested little promise of interesting results to be derived from considering gene expression data in isolation, so we ultimately decided to focus on a more general predictive model involving more feature types instead.

To determine the chromosomal location and biological role of any gene, we used this gene annotation database provided by affymetrix. Using this database, we can add columns to the gene expression profile dataset for the chromosomal column and gene name of every gene. From this, we can isolate genes based on their chromosomal location.

Chromosome 21 genes present in gene expression profile dataset:  609

Using the Chromosome 21 Dataset

SubjectID	LocusLink	Symbol	116_S_1249	037_S_4410	006_S_4153	116_S_1232	099_S_4205	007_S_4467	128_S_0205	003_S_2374	...	014_S_4668	130_S_0289	141_S_4456	009_S_2381	053_S_4557	073_S_4300	041_S_4014	007_S_0101	Biological Name	Chromosome
Gene_PSID
Visit	NaN	NaN	m48	v03	v03	m48	v03	v03	v06	bl	...	v03	m60	v03	bl	v03	v03	v03	v06	NaN	NaN
Phase	NaN	NaN	ADNIGO	ADNI2	ADNI2.1	ADNIGO.1	ADNI2.2	ADNI2.3	ADNI2.4	ADNIGO.2	...	ADNI2.443	ADNIGO.293	ADNI2.444	ADNIGO.294	ADNI2.445	ADNI2.446	ADNI2.447	ADNI2.448	Unnamed: 747	NaN
260/280	NaN	NaN	2.05	2.07	2.04	2.03	2.01	2.05	1.95	1.99	...	2.05	1.98	2.09	1.87	2.03	2.11	1.94	2.06	NaN	NaN
260/230	NaN	NaN	0.55	1.54	2.1	1.52	1.6	1.91	1.47	2.07	...	2.05	1.65	1.56	1.45	1.33	0.27	1.72	1.35	NaN	NaN
RIN	NaN	NaN	7.7	7.6	7.2	6.8	7.9	7	7.9	7.2	...	6.5	6.3	6.4	6.6	6.8	6.2	5.8	6.7	NaN	NaN
Affy Plate	NaN	NaN	7	3	6	7	9	4	3	8	...	6	9	3	8	5	3	1	4	NaN	NaN
YearofCollection	NaN	NaN	2011	2012	2011	2011	2011	2012	2011	2011	...	2012	2011	2012	2011	2012	2011	2011	2012	NaN	NaN
ProbeSet	LocusLink	Symbol	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11715130_s_at	LOC337967	KRTAP6-2	2.424	2.623	2.501	3.103	2.567	2.992	2.249	2.63	...	3.074	2.763	2.89	3.155	2.526	2.452	2.822	2.651	[KRTAP6-2] keratin associated protein 6-2	21
11715131_s_at	LOC337975	KRTAP20-1	1.826	2.306	2.735	2.777	2.897	2.631	2.613	2.481	...	2.46	2.374	2.074	3.121	2.725	2.21	2.828	2.444	[KRTAP20-1] keratin associated protein 20-1	21
11715144_s_at	LOC337974	KRTAP19-7	3.256	3.404	4.112	3.279	3.67	3.279	3.257	3.86	...	3.856	3.64	3.642	3.772	4.187	3.579	4.542	3.92	[KRTAP19-7] keratin associated protein 19-7	21
11715145_s_at	LOC337976	KRTAP20-2	3.529	4.029	4.254	4.094	3.629	3.632	3.614	4.021	...	4.026	3.808	3.966	4.134	3.881	3.833	4.112	3.945	[KRTAP20-2] keratin associated protein 20-2	21
11715156_s_at	LOC337966	KRTAP6-1	3.855	3.9	4.124	4.428	3.947	4.371	4.14	4.129	...	4.783	3.961	4.035	4.428	3.972	4.208	4.622	4.147	[KRTAP6-1] keratin associated protein 6-1	21
11715157_s_at	LOC337969	KRTAP19-2	2	2.162	2.135	2.144	2.144	2.147	1.938	2.27	...	2.19	2.045	2.545	2.222	2.332	1.998	2.133	2.238	[KRTAP19-2] keratin associated protein 19-2	21
11715158_s_at	LOC337971	KRTAP19-4	2.682	2.993	2.778	2.904	2.714	2.672	2.837	2.578	...	2.617	2.877	2.944	2.612	2.729	2.7	2.837	2.582	[KRTAP19-4] Keratin associated protein 19-4	21

15 rows × 748 columns

From this dataset, we find 13 features corresponding to probes of the APP gene, a major implicated gene in Alzheimer’s Disease.

To analyse the role of different genes on chromosome 21 in the etiology of Alzheimer’s Disease, a good baseline model would be to use these 13 features in a simple baseline model, so we fitted a simple logistic regression to an X data set of these 13 features.

To ensure that imbalance in the data was not an issue, we used Synthetic Minority Over-sampling Technique (SMOTE) using the imblearn package.

The cross-validation and test scores for this baseline model are:

Cross-validation score  0.5798706240487063
Test score  0.5178571428571429

Both appear only marginally better than random. APP gene expression profile appears to be a very poor predictor of AD.

Next, we looked to build a similar simplistic model using all genes on chromosome 21 to see whether there was anything in particular that stood out as interesting.

The results of this are:

Cross-validation score  0.8115677321156773
Test score  0.6965041721563461

These results are slightly more promising, and warrant an investigation into the details of the model: what are the genes with the largest magnitude associated coefficient?

Biggest positive coefficient:

	Probe Set ID	UniGene ID	Alignments	Gene Title	Gene Symbol	Chromosomal Location	Entrez Gene	SwissProt	RefSeq Protein ID	RefSeq Transcript ID	Gene Ontology Biological Process	Gene Ontology Cellular Component	Gene Ontology Molecular Function	Pathway	InterPro	Trans Membrane	Chromosome
133	11715233_s_at	Hs.381214	chr21:47581062-47604373 (-) // 95.94 // q22.3	spermatogenesis and centriole associated 1-like	SPATC1L	chr21q22.3	84221	Q9H0A9	NP_001136326 /// NP_115637 /// XP_005261245 //...	NM_001142854 /// NM_032261 /// XM_005261188 //...	---	---	0005515 // protein binding // inferred from ph...	---	IPR029384 // Speriolin, C-terminal // 1.0E-75 ...	---	21

Biggest negative coefficient:

	Probe Set ID	UniGene ID	Alignments	Gene Title	Gene Symbol	Chromosomal Location	Entrez Gene	SwissProt	RefSeq Protein ID	RefSeq Transcript ID	Gene Ontology Biological Process	Gene Ontology Cellular Component	Gene Ontology Molecular Function	Pathway	InterPro	Trans Membrane	Chromosome
8325	11723425_at	Hs.529400	chr21:34697208-34732236 (+) // 82.3 // q22.11	interferon (alpha, beta and omega) receptor 1	IFNAR1	chr21q22.11	3454	P17181	NP_000620 /// XP_005261021 /// XP_011527854	NM_000629 /// XM_005260964 /// XM_011529552	0007166 // cell surface receptor signaling pat...	0005622 // intracellular // traceable author s...	0004904 // interferon receptor activity // inf...	---	IPR003961 // Fibronectin type III // 2.1E-35 /...	---	21

Here, the gene with the largest positive coefficient in this model has no identified link with cognitive function, autophagy, protein synthesis, calcium signalling, immune system, inflammation or any other process linked with neurodegenerative disease - and is only moderately expressed in the brain. Moreover, the gene with the strongest negative coefficient has actually been previously been positively linked with amyloidogenesis in mice, contrary to what our model might imply.

This suggests that there may be limited immediate benefit to using gene expression profile data. To further consider this, we plotted the gene expression distribution for all 13 APP gene probes and also the gene probe found to be the strongest predictor for our model - we see no clear difference in distribution between AD and non-AD, implying again that gene expression data may not be as useful as first thought.

/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

png

/anaconda3/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

png

When researching details of the gene expression dataset, it was found the the samples from which the data is derived were taken from blood, not the central nervous system. This means that the gene expression profiles are not indicative of the cellular environment in the brain, as the gene expression profiles of different tissues varies hugely and the central nervous system is separated from the rest of the body by a near-impervious blood-brain-barrier. As such, even though other models could be tested and different chromosomes could be examined, it was decided that the chance of finding valuable data was too low, and as such the group focussed on other aims.

Sources

[1] https://www.nia.nih.gov/health/alzheimers-disease-people-down-syndrome

[2] https://www.alz.org/alzheimers-dementia/what-is-dementia/types-of-dementia/down-syndrome

[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4019841/

https://www.ncbi.nlm.nih.gov/pubmed/18199027

http://www.bbc.com/future/story/20181022-there-is-mounting-evidence-that-herpes-leads-to-alzheimers