Timur Garipov, Pavel Izmailov et al, (2018)
Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
https://arxiv.org/pdf/1802.10026.pdf
Paper: https://arxiv.org/pdf/1802.10026.pdf
Code: https://github.com/timgaripov/dnnmodeconnectivity
This paper shows a very interesting empirical result. It shows that in weight space of neural networks the minima (modes) can be connected with a path, along which the loss is near to constant. This is important because, we can an ensemble of models with same accuracy but with different parameter. The path connecting different modes connected by line segment has high training loss, so it was believed that modes in loss surfaces are isolated.
Finding paths between modes
Let $\hat{w_1}$ and $\hat{w_2}$ in $\mathbb{R}^{net}$ be two sets of weights corresponding to two neural nets trained independetly, with a loss $L(w)$ and let $\phi _ { \theta }$ $:$ $[0,1] \rightarrow$ be $\mathbb{R}^{net}$ be a continuous piecewise smooth parametric curve, with parameters $\theta$, such that $\phi _ {\theta}(0) = \hat{w_1}$, $\phi _ {\theta}(1) = \hat{w_2}$
\begin{align}
\hat { \ell } ( \theta ) = \frac { \int \mathcal { L } \left( \phi _ { \theta } \right) d \phi _ { \theta } } { \int d \phi _ { \theta } } = \frac { \int _ { 0 } ^ { 1 } \mathcal { L } \left( \phi _ { \theta } ( t ) \right) \left \phi _ { \theta } ^ { \prime } ( t ) \right d t } { \int _ { 0 } ^ { 1 } \left \phi _ { \theta } ^ { \prime } ( t ) \right d t } = \int _ { 0 } ^ { 1 } \mathcal { L } \left( \phi _ { \theta } ( t ) \right) q _ { \theta } ( t ) d t = \mathbb { E } _ { t \sim q _ { \theta } ( t ) } \left[ \mathcal { L } \left( \phi _ { \theta } ( t ) \right) \right]
\end{align}
where the distribution $q _ { \theta } ( t )$ on $t \in [ 0,1 ]$ is defined as: $q _ { \theta } ( t ) = \left \left \phi _ { \theta } ^ { \prime } ( t ) \right \right \cdot \left( \int _ { 0 } ^ { 1 } \left \phi _ { \theta } ^ { \prime } ( t ) \right d t \right) ^ {  1 }$
The curve is found by minimizing the above equation at each iteration and make a
gradient step for $\theta$ with respect to loss $L(\phi(t))$
The curve is parametrized with a single bend polygonal chain and bezier curve. The accuracy paths connecting different modes were found across a range of architectures and datasets.
Comments

A new ensembling approach is proposed (Fast Geometric Ensembling), now we can find diverse networks with relatively small steps in weight space, without leaving the region that corresponds to the low test error.

For test predictions for an ensemble of n models requires requires n times more computation.

A single network can be trained with this new enslmbling approach in the time required to train a single network and also find the connecting curve.

These insights are very interesting and it is surprising that modes can be connected with very simple curves.