Timur Garipov, Pavel Izmailov et al, (2018)
Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs
https://arxiv.org/pdf/1802.10026.pdf

This paper shows a very interesting empirical result. It shows that in weight space of neural networks the minima (modes) can be connected with a path, along which the loss is near to constant. This is important because, we can an ensemble of models with same accuracy but with different parameter. The path connecting different modes connected by line segment has high training loss, so it was believed that modes in loss surfaces are isolated.

### Finding paths between modes

Let $\hat{w_1}$ and $\hat{w_2}$ in $\mathbb{R}^{|net|}$ be two sets of weights corresponding to two neural nets trained independetly, with a loss $L(w)$ and let $\phi _ { \theta }$ $:$ $[0,1] \rightarrow$ be $\mathbb{R}^{|net|}$ be a continuous piecewise smooth parametric curve, with parameters $\theta$, such that $\phi _ {\theta}(0) = \hat{w_1}$, $\phi _ {\theta}(1) = \hat{w_2}$

\begin{align}
\hat { \ell } ( \theta ) = \frac { \int \mathcal { L } \left( \phi _ { \theta } \right) d \phi _ { \theta } } { \int d \phi _ { \theta } } = \frac { \int _ { 0 } ^ { 1 } \mathcal { L } \left( \phi _ { \theta } ( t ) \right) \left| \phi _ { \theta } ^ { \prime } ( t ) \right| d t } { \int _ { 0 } ^ { 1 } \left| \phi _ { \theta } ^ { \prime } ( t ) \right| d t } = \int _ { 0 } ^ { 1 } \mathcal { L } \left( \phi _ { \theta } ( t ) \right) q _ { \theta } ( t ) d t = \mathbb { E } _ { t \sim q _ { \theta } ( t ) } \left[ \mathcal { L } \left( \phi _ { \theta } ( t ) \right) \right]
\end{align}

where the distribution $q _ { \theta } ( t )$ on $t \in [ 0,1 ]$ is defined as: $q _ { \theta } ( t ) = \left| \left| \phi _ { \theta } ^ { \prime } ( t ) \right| \right| \cdot \left( \int _ { 0 } ^ { 1 } \left| \phi _ { \theta } ^ { \prime } ( t ) \right| d t \right) ^ { - 1 }$

The curve is found by minimizing the above equation at each iteration and make a
gradient step for $\theta$ with respect to loss $L(\phi(t))$

The curve is parametrized with a single bend polygonal chain and bezier curve. The accuracy paths connecting different modes were found across a range of architectures and datasets.