Bridging Theory and Practice in Deep Learning: Optimization and Generalization

Li, Zhiyuan

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01qb98mj69x

Title:	Bridging Theory and Practice in Deep Learning: Optimization and Generalization
Authors:	Li, Zhiyuan
Advisors:	Arora, Sanjeev
Contributors:	Computer Science Department
Keywords:	Deep Learning Generalization Implicit Bias Learning Theory Machine Learning Optimization
Subjects:	Computer science
Issue Date:	2022
Publisher:	Princeton, NJ : Princeton University
Abstract:	Deep learning has been hugely successful for several important applications in the past decade, yet mathematical understanding has lagged behind its breathtaking empirical success. Classic machine learning theory is insufficient to explain various new phenomena in deep learning and to provide guidance on algorithmic choices, largely due to an oversimplified black box view that ignores the interaction between the model and the optimization algorithm. This dissertation presents a collection of theoretical results that take the interplay between the model and the optimization algorithm into account and aims to bridge the gaps between theory and practice in deep learning for both generalization and optimization. For optimization, we first illustrate the mismatches between traditional optimization theory and deep networks with normalization layers by presenting an exponentially increasing learning rate schedule that works well empirically. We explain this surprise by establishing its equivalence to SGD with Weight Decay and proving that their convergence rates are fast and insensitive to initialization scale. Based on this, we design a variant of BERT named SIBERT, which is trainable by SGD and thus more memory-efficient than adaptive algorithms like ADAM. Finally, we present the first provable yet general setting where gradient descent decreases loss in a non-monotone way, as observed empirically. For generalization, we study the implicit bias of optimization algorithms, which refers to the phenomenon that the algorithm returns solutions with good generalization despite the existence of solutions with poor generalization due to the overparametrized models. We first give a rigorous justification of why convolutional networks are more sample-efficient than fully-connected networks. Then we provide theoretical justification for the empirical observation that deep linear networks, including matrix factorization, trained by gradient descent from small initialization implicitly bias to low-rank solutions. We also identify a condition when gradient descent with reparametrization is equivalent to mirror descent which can be used to understand implicit bias of non-linear models and recovers several previous results. We further show gradient descent has an implicit bias for `flatter' solutions when having certain gradient noise or its learning rate is larger than two over sharpness of loss.
URI:	http://arks.princeton.edu/ark:/88435/dsp01qb98mj69x
Alternate format:	The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu
Type of Material:	Academic dissertations (Ph.D.)
Language:	en
Appears in Collections:	Computer Science

Files in This Item:

File	Description	Size	Format
Li_princeton_0181D_14299.pdf		8.63 MB	Adobe PDF	View/Download

Show full item record

Search

Browse