Overparameterized classification problems: How many support vectors do I have, and do increasing margins bode well for generalization?

Narang, Adhyyan

PDF

Description

We study phenomena that arise during classification using linear and lifted models in overparameterized settings, presenting new perspectives on the work in Muthukumar et al. [19, 18]. To simulate real-world setups where only some of our features are actually useful, we consider the simplified 1-sparse model. We review a sharp characterization of generalization of min-ℓ2 interpolation on Gaussian data [18].

In the hard-margin support vector machine (HM-SVM) problem, we show for the stylized ridge featurization that for a sufficient degree of “effective overparameterization”, all training points become support vectors. We remark how this simple featurization captures the essential behavior in other featurizations as well. A consequence of this theorem is the HM-SVM problem is indistinguishable from the min-ℓ2 binary interpolator. Along with prior work that gradient descent initialized at 0 on the squared loss converges to the min-ℓ2 binary interpolator, and that gradient descent on the logistic loss converges to the HM-SVM, our work conveys that the choice of loss function does not have a large effect on the learned parameters in overparameterized settings.

In the regimes that the above theorem holds, we examine whether margin-based explanations for generalization are able to account for some observations for our problem. We observe for linear models that a) the resultant bounds on the probability of misclassification on a test point exceed 1 and are hence tautological and b) a model with a larger margin often does not have a better generalization performance. For kernel-inspired models, we investigate what the right normalization for margins is, and design two new notions of margin that are appropriate for these.

As a preliminary exposition of ongoing work, we examine the ramifications of our results for adversarial performance on the Fourier featurization. We discover, using visualizations that our learned function exhibits Gibbs-like behavior around jump discontinuities, and this causes adversarial examples to proliferate in the vicinity of training points.