High-dimensional methods to model biological signal in genome-wide studies

Bass, Andrew Jay

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01m326m485q

Title:	High-dimensional methods to model biological signal in genome-wide studies
Authors:	Bass, Andrew Jay
Advisors:	Storey, John D
Contributors:	Quantitative Computational Biology Department
Keywords:	False discovery rates Latent variable models Optimal discovery procedure Population structure Statistical inference
Subjects:	Biostatistics
Issue Date:	2021
Publisher:	Princeton, NJ : Princeton University
Abstract:	Recent advancements in sequencing technology have substantially increased the quality and quantity of data in genomics, presenting novel analytical challenges for biological discovery. In particular, foundational ideas developed in statistics over the past century are not easily extended to these high-dimensional datasets. Therefore, creating novel methodologies to analyze this data is a key challenge faced in statistics, and more generally, biology and computational science. Here I focus on building statistical methods for genome-wide analysis that are statistically rigorous, computationally fast, and easy to implement. In particular, I develop four methods that improve statistical inference of high-dimensional biological data. The first focuses on differential expression analysis where I extend the optimal discovery procedure (ODP) to complex study designs and RNA-seq studies. I find that the extended ODP leverages shared biological signal to substantially improve the statistical power compared to other commonly used testing procedures. The second aims to model the functional relationship between sequencing depth and statistical power in RNA-seq differential expression studies. The resulting model, superSeq, accurately predicts the improvement in statistical power when sequencing additional reads in a completed study. Thus superSeq can guide researchers in choosing a sufficient sequencing depth to maximize statistical power while avoiding unnecessary sequencing costs. The third method estimates the posterior distribution of false discovery rate (FDR) quantities, such as local FDRs and q-values, using a Bayesian nonparametric approach. Specifically, I implement an approximation to these posterior distributions that is scalable to genome-wide datasets using variational inference. These estimated posterior distributions are informative in a significance analysis as they capture the uncertainty of FDR quantities in reported results. Finally, I develop a likelihood-based approach to estimating unobserved population structure on the canonical parameter scale. I demonstrate that this framework can flexibly capture arbitrary structure and provide accurate allele frequency estimates while being computationally fast for large population genetic studies. Therefore, this framework is useful for many applications in population genetics, such as accounting for structure in the genome-wide association testing procedure GCATest. Collectively, these four methods address problems typically encountered in a biological analysis and can thus help improve downstream inferences in high-dimensional settings.
URI:	http://arks.princeton.edu/ark:/88435/dsp01m326m485q
Alternate format:	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu
Type of Material:	Academic dissertations (Ph.D.)
Language:	en
Appears in Collections:	Quantitative Computational Biology

Files in This Item:

File	Description	Size	Format
Bass_princeton_0181D_13886.pdf		11.37 MB	Adobe PDF	View/Download

Show full item record

Search

Browse