Due to the prevalence of machine learning algorithms and the potential for their decisions to profoundly impact billions of human lives, it is crucial that they are robust, reliable, and understandable. This thesis examines key theoretical pillars of machine learning surrounding generalization and overfitting, and tests the extent to which empirical behavior matches existing theory. We develop novel methods for measuring overfitting and generalization, and we characterize how reproducible observed behavior is across differences in optimization algorithm, dataset, task, evaluation metric, and domain.
First, we examine how optimization algorithms bias machine learning models towards solutions with varying generalization properties. We show that adaptive gradient methods empirically find solutions with inferior generalization behavior compared to those found by stochastic gradient descent. We then construct an example using a simple overparameterized model that corroborates the algorithms’ empirical behavior on neural networks. Next, we study the extent to which machine learning models have overfit to commonly reused datasets in both academic benchmarks and machine learning competitions. We build new test sets for the CIFAR-10 and ImageNet datasets and evaluate a broad range of classification models on the new datasets. All models experience a drop in accuracy, which indicates that current accuracy numbers are susceptible to even minute natural variations in the data distribution. Surprisingly, despite several years of adaptively selecting the models to perform well on these competitive benchmarks, we find no evidence of overfitting. We then analyze data from the machine learning platform Kaggle and find little evidence of substantial overfitting in ML competitions. These findings speak to the robustness of the holdout method across different data domains, loss functions, model classes, and human analysts.
Overall, our work suggests that the true concern for robust machine learning is distribution shift rather than overfitting, and designing models that still work reliably in dynamic environments is a challenging but necessary undertaking.
Measuring Generalization and Overfitting in Machine Learning
Researchers may make free and open use of the UC Berkeley Library’s digitized public domain materials. However, some materials in our online collections may be protected by U.S. copyright law (Title 17, U.S.C.). Use or reproduction of materials protected by copyright beyond that allowed by fair use (Title 17, U.S.C. § 107) requires permission from the copyright owners. The use or reproduction of some materials may also be restricted by terms of University of California gift or purchase agreements, privacy and publicity rights, or trademark law. Responsibility for determining rights status and permissibility of any use or reproduction rests exclusively with the researcher. To learn more or make inquiries, please see our permissions policies (https://www.lib.berkeley.edu/about/permissions-policies).