CS 5764: Information Visualization

Homework #4: Data Analysis

The goal of this homework is to analyze a large dataset and make a conclusion.

Instructions:

Examine the following bioinformatics dataset: Lupus.xls. A detailed description of this dataset is included below.
Use whatever tools and methods you want, e.g. Excel, TableLens/Eureka, Spotfire, Parallel coordinates, SAS JMP, your favorite tool, your own hacking, etc. Some of these are available in McBryde 104c.
Your goal is to identify THE KEY FINDING in this data. Pretend that you are an experimental biologists trying to understand the key difference between Lupus and healthy patients, and perhaps develop a method to identify Lupus patients based on their gene expression data.

Hand in: (in class)

Hard-copy of 1-2 page report that lists:

The single most-critical major finding in the data.

A few additional insights.

Based on this, how would you recommend screening patients to determine if they likely have the Lupus disease.

How accurate do you think your screening method would be, based on this sample data.

What visual representation(s) best enabled analysis of this data, and why?

Include a picture or screenshot or other evidence to support your claim(s).

Make it easy to parse. I like bullets.

Dataset Description:

A Single Channel Microarray Experiment Comparing Host Gene Expression in Patients with an Autoimmune Disease and Healthy Controls

ASSUMPTION: Quality assurance and preprocessing have already been performed on these data and they are reliable. All values passed statistical significance tests and now you want to examine the data for biological content and learn something about the pathophysiology of systemic lupus erythematosus (SLE), the autoimmune disease under study.

THE EXPERIMENT: You are using published data from Timothy Behren’s lab. SLE, commonly referred to as “lupus”, is a chronic, inflammatory autoimmune disease whereby the body produces antibodies against a broad range of its own proteins. Organs commonly targeted include skin, kidney, joints, lung, and the central nervous system. The severity of the disease, spectrum of symptoms, and response to treatment vary tremendously between patients, leading to significant challenges in diagnosis and management of the illness. In this study, after blood draw, peripheral blood mononuclear cells (PBMCs), comprising monocytes/macrophages, B and T lymphocytes, and NK cells, were isolated from control and SLE samples. mRNA was harvested for expression profiling using Affymetrix technology. The column values represent gene expression values (average difference or AD) for each gene. Scaling was performed to allow comparison between chips. A subset of genes has been selected for your analysis. (Gene expression is essentially a measure of how active a gene is.)

Analyze the data set provided to gain as much understanding as possible about how the gene expression is different in the SLE group and the control group. Before starting your data analysis, think about the kinds of questions you want to ask and the kinds of information that you will consider important in making your evaluation.

The data spreadsheet will look like the one below. The actual data set has 42 Controls and 48 SLE individuals, and approximately 170 genes.

Accession no.	Gene	Ctrl 1	Ctrl 2	Ctrl 3	SLE 1	SLE 2
M13755	Interferon-stimulated protein, 15 kDa	4173.1	4421.1	3885	3325.1	18664