Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01rv042w30x
Title: Statistical Inference of Variables Driving Systematic Variation in High-Dimensional Biological Data
Authors: Chung, Neo Christopher Honghoon
Advisors: Storey, John D
Contributors: Quantitative Computational Biology Department
Keywords: data
jackstraw
latent variable model
principal component analysis
resampling
sparse pca
Subjects: Biostatistics
Bioinformatics
Statistics
Issue Date: 2014
Publisher: Princeton, NJ : Princeton University
Abstract: Modern genomic technologies collect an ever-increasing amount of information (e.g., gene expression and genotypes) about model organisms and humans. Systematic patterns of variation in such large-scale biological studies reflect the underlying molecular signatures of disease status, environment, and others, and can be quantified using principal component analysis (PCA) and related methods. For example, histological examination of tumor cells has long provided clinical classifications of cancer which are indirect, imprecise, and low-resolution. In contrast, we can infer different types of cancer directly from gene expression profiles of cancerous tumor samples. An unsolved problem in this context is how to systematically identify the observed variables that are drivers of systematic variation captured by PCA. My dissertation introduces a statistical framework to rigorously utilize a quantitative characterization of systematic variation. The key challenge in utilizing latent variable estimates -- such as principal components (PCs) -- is how to prevent overfitting. It is well established that conventional statistical tests for association using quantities estimated from the data itself will artificially inflate statistical significance, because the data is used twice. We introduce a general resampling approach, called the jackstraw, to calculate statistical significance of association between the observed variables and their latent variables, while automatically adjusting for how much PCA overfits the particular dataset. Furthermore, based on weights derived from the jackstraw, we developed significance-based shrinkage methods for the loadings of PCs and high-dimensional covariance matrices, called the jackstraw weighted shrinkage. Incorporating this set of proposed methods, we investigated genetic differentiation due to the global human population structure. Overall, the proposed statistical framework makes minimal assumptions and offers flexibility in exploring and analyzing the data, while providing a safeguard against an anti-conservative bias due to overfitting.
URI: http://arks.princeton.edu/ark:/88435/dsp01rv042w30x
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Quantitative Computational Biology

Files in This Item:
File Description SizeFormat 
Chung_princeton_0181D_11068.pdf6.24 MBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.