Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra (2015)
Weight Uncertainity in Nueral Networks
https://arxiv.org/pdf/1505.05424.pdf

When deep learning is used in sensitive domains such as healthcare, we should question not just the accuracy but also the confidence of the models being deployed. To incorporate this notion of confidence or uncertainity, this paper proposes an architecture called the Bayesian neural network and introduces a training algorithm called Bayes-by-Backprop.

Architecture

bnn

  • To treat a neural network in a probabilistic manner, we represent each of it’s weights using a distribution rather than a single numeric value as is commonly used. For example, as shown in Figure 1, each weight can be represented as a gaussian random variable.

Tranining and Loss function

  • The normal approach to training neural networks by updating weights using gradient descent seeks to find the weights which best explain the data. This can be seen as learning the weights which maximize the likelihood $P(\mathcal{D} \vert \mathbf{w})$, through Maximum Likelihood Estimation (MLE).

  • This paper considers a bayesian approach to predict the posterior distribution of weights given the data. However, in most cases the posterior is intractable to obtain.

  • To address this issue, this paper uses variational inference, where the posterior $P(\mathbf{w} | \mathcal{D})$ is modelled using some tractable distribution $π‘ž(w | πœƒ)$. The parameters $πœƒ$ of the distribution on the weights (commly referred to as the variational posterior) that minimizes the KL divergence with the true posterior are found. Formally this can be expressed as:
    \begin{align}
    πœƒ^*&=\arg \min_πœƒ KL[π‘ž(𝐰 | πœƒ) ;||; 𝑃(𝐰 | \mathcal{D})]
    \end{align}

    which can be simplified as minimizing the following cost function.

\begin{align}
\mathcal{F}(\mathcal{D}, \mathbf{\theta}) = \text{KL}[q(\mathbf{w}\ |\ \mathbf{\theta})\ ||\ P(\mathbf{w})] - \mathbb{E}_{q(\mathbf{w}\ |\ \mathbf{πœƒ})}[\log P(\mathcal{D}\ |\ \mathbf{w})]
\end{align}

  • As one can easily see, the cost function tries to balance the complexity of the data $P( \mathcal{D} | 𝐰)$ and the simplicity of the prior $P(𝐰)$.

Back propogation based approximation for training

  • Calculating the expectation of the likelihood over the variational posterior is computationally prohibitive, so we will rely on an approximate method to determining it. We approximate our cost function using sampled weights:

\begin{align}
F(D,ΞΈ)β‰ˆ \sum_{i=1}^N \log q(w^{(i)} | ΞΈ)βˆ’ \log P(w^{(i)})- \log P(D∣w^{(i)}),
\end{align}
where $w^{(i)}$ are sampled from the variational posterior.

  • When using automatic differentiation as provided by frameworks such as PyTorch, we only need to worry about implementing this sampling, and setting up the cost function as above, and can leverage our usual backpropagation methods to train a model.

Key advantages

Model Ensembling

  • Having trained distributions on the weights of our model, we now effectively have an infinite ensemble of neural networks to improve the accuracy. We can leverage this by combining the outputs from different samples of our model weights.

Measuring Uncertainity

  • One can measure uncertainity as follows. Sample multiple times from the model and do multiple forward passes for the same image, and plot the histogram of each sample’s individual prediction.
  • If it is consistently predicting the same output in each forward pass, then it is confident about that prediction. Else it is not very certain about its prediction.

Experimental Results

  • The proposed network is trained on MNIST digit dataset with a simple two layer MLP network. It is observed that the improvement in accuracy when compared to standard nueral network is similar to the improvement one would get while using Dropout mechansim. In other words, weight uncertainity may be treated as one possible alternative for dropout mechanism.

  • This concept of weight uncertainity is also used in the classic reinforcement learning problem of contextual bandits. It is shown that Bayes by Backprop can automatically learn how to trade-off exploration and exploitation.