Why is SAM Robust to Label Noise?

Abstract

Sharpness-Aware Minimization (SAM) is most known for achieving state-ofthe-art performances on natural image and language tasks. However, its mostpronounced improvements (of tens of percent) is rather in the presence of labelnoise. Understanding SAM's label noise robustness requires a departure fromcharacterizing the robustness of minimas lying in "flatter" regions of the losslandscape. In particular, the peak performance under label noise occurs withearly stopping, far before the loss converges. We decompose SAM's robustnessinto two effects: one induced by changes to the logit term and the otherinduced by changes to the network Jacobian. The first can be observed in linearlogistic regression where SAM provably up-weights the gradient contributionfrom clean examples. Although this explicit up-weighting is also observable inneural networks, when we intervene and modify SAM to remove this effect,surprisingly, we see no visible degradation in performance. We infer that SAM'seffect in deeper networks is instead explained entirely by the effect SAM hason the network Jacobian. We theoretically derive the implicit regularizationinduced by this Jacobian effect in two layer linear networks. Motivated by ouranalysis, we see that cheaper alternatives to SAM that explicitly induce theseregularization effects largely recover the benefits in deep networks trained onreal-world datasets.

Quick Read (beta)

loading the full paper ...