Provable Tempered Overfitting of Minimal Nets and Typical Nets

By Itamar Harel, William M. Hoza, Gal Vardi, Itay Evron, Nathan Srebro, and Daniel Soudry [non-alphabetical]

Read the paper: NeurIPS proceedings • arXiv

Abstract (for specialists)

We study the overfitting behavior of fully connected deep Neural Networks (NNs) with binary weights fitted to perfectly classify a noisy training set. We consider interpolation using both the smallest NN (having the minimal number of weights) and a random interpolating NN. For both learning rules, we prove overfitting is tempered. Our analysis rests on a new bound on the size of a threshold circuit consistent with a partial function. To the best of our knowledge, ours are the first theoretical results on benign or tempered overfitting that: (1) apply to deep NNs, and (2) do not require a very high or very low input dimension.

Not-so-abstract (for curious outsiders)

⚠️ This summary might gloss over some important details.

According to traditional wisdom in machine learning, it's a bad sign if your hypothesis perfectly fits your training data. It suggests that you're using too many parameters, you're fitting the noise instead of the signal, and your hypothesis won't perform well when it is tested on new data. This phenomenon is called "overfitting."

However, this is not what is observed empirically in modern machine learning. Machine learning practitioners have found that deep neural networks perform well when they are tested on new data, even if they are trained to perfectly fit their training data.

Our paper develops theoretical explanations for this empirical phenomenon. Consider any binary classification problem (e.g., given a photo, determine whether it depicts a cat). Assume that there does in fact exist a neural network $h^{\star}$ that solves this problem; our hypothesis will consist of a neural network $h$ that is considerably larger than $h^{\star}$. Specifically, given a bunch of noisy labeled examples, suppose we pick the parameters of $h$ uniformly at random, and then we repeat until we find an $h$ that perfectly fits the given training data. (We work with a simplified "neural network" model in which the weights are binary and the activation functions are simple thresholding.) Under these idealized assumptions, we prove that overfitting is "tempered," meaning that when the hypothesis $h$ is tested on new data, its performance will be better than a completely trivial hypothesis.

We posted the paper to arXiv in October 2024; Itamar presented it at NeurIPS in December 2024. An earlier version of the paper also appeared at the 2nd Workshop on High-dimensional Learning Dynamics (HiLD) at ICML 2024.

Expository material:

[Video]. This is Itamar's video for NeurIPS (December 2024).
[Poster png]. This is Itamar's poster for NeurIPS (December 2024).