Pavel Izmailov, Dmitrii Podoprikhin et al (2018),
Averaging Weights Leads to Wider Optima and Better Generalization



This work is an extension of [Garipov et al] loss-surface. Fast Geometric Ensembling (FGE) proposed by Garipov et al required n times more computation at test time, n being number of models in the ensemble. This work shows that simple averaging of multiple points along the trajectory of SGD, with cyclical or constant learning rate, leads to better generalization than coneventional training. This Stochastic Weight Averaging (SWA) approximates FGE with single model.

  • It has been conjectured by [Keskar et al. 2017] keskar-paper and [Hochreiter et al. 1997] schmidhuber-paper that width of the optima is correlated with generalization.

  • [Mandt et al. 2017] mandt-paper: SGD with fixed learning rate samples from a Gaussian distribution centered at the minimum of the loss. SGD iterates concentrate on a surface of an ellipsoid. Averaging lets us go inside the ellipsoid.


The authors found that even using a cyclical learning rate the observations made by [Mandt et al. 2017] mandt-paper still hold.

The SWA algorithm:

  • Use cyclical learning or constant learning rate
  • Average weights 1) Cyclical LR - at the end of the each batch 2) Constant LR - at the end of the epoch
  • Recompute Batch Normalization statistics at the end of training