Pavel Izmailov, Dmitrii Podoprikhin et al (2018),
Averaging Weights Leads to Wider Optima and Better Generalization
http://auai.org/uai2018/proceedings/papers/313.pdf
Code: https://github.com/timgaripov/swa
Paper: http://auai.org/uai2018/proceedings/papers/313.pdf
This work is an extension of [Garipov et al] losssurface. Fast Geometric Ensembling (FGE) proposed by Garipov et al required n times more computation at test time, n being number of models in the ensemble. This work shows that simple averaging of multiple points along the trajectory of SGD, with cyclical or constant learning rate, leads to better generalization than coneventional training. This Stochastic Weight Averaging (SWA) approximates FGE with single model.

It has been conjectured by [Keskar et al. 2017] keskarpaper and [Hochreiter et al. 1997] schmidhuberpaper that width of the optima is correlated with generalization.

[Mandt et al. 2017] mandtpaper: SGD with fixed learning rate samples from a Gaussian distribution centered at the minimum of the loss. SGD iterates concentrate on a surface of an ellipsoid. Averaging lets us go inside the ellipsoid.
Method
The authors found that even using a cyclical learning rate the observations made by [Mandt et al. 2017] mandtpaper still hold.
The SWA algorithm:
 Use cyclical learning or constant learning rate
 Average weights 1) Cyclical LR  at the end of the each batch 2) Constant LR  at the end of the epoch
 Recompute Batch Normalization statistics at the end of training