Y Gal, Z Ghahramani (2015)

Dropout as a Bayesian Approximation: Insights and Applications

https://arxiv.org/pdf/1506.02142.pdf

The crux of bayesian methods is considering parameters of the model as distributions rather than point estimates. This point of view offers us uncertainty measure for predictions.

### Background

*Predictive posterior distribution*: Given inputs $X$, targets $Y$, parameters $\theta$ and new input point $x^{*}$, then the predictive posterior distribtion is

$$ p(y^* | x^* , \theta, X, Y) = \int p(y^* | x^*, \theta) p(\theta | X, Y) d \theta$$

The distribution $p(\theta | X, Y)$ is called posterior distribution over parameters, and it can not be evaluated analytically for complex models such as neural networks. So a distribution $q(\theta)$ is chosen from a family of "nice" distributions and KL divergence $KL(q(\theta) | p(\theta | X,Y))$ is minimized so that $q(\theta)$ is closer to posterior.

Another important result is minimizing KL divergence = maximizing log evidence lower bound,

$$L_{VI} := \int q(\theta) \ log \ p(Y | X, \theta) d \theta - KL(q(\theta) || p(\theta)) $$

with respect to the variational parameters defining $q(\theta)$

### Algorithm

The authors proved that using dropout during test time is a approximate bayesian inference and it provides good uncertainty estimates. Bernoulli distribution is chosen as variational distribution $ q(\theta) $.

At test time should use Monte Carlo (MC) integration with $T$ terms

$$ p(y^* | x^* , \theta, X, Y) \approx \sum ^{T} _{i = 1} p(y^* | x^*, \hat{\theta _{i}} )$$ with $\hat{\theta _{i}} \sim q(\theta)$ .

Hence this is named MC dropout.

### Comments

- The paper is well written and provided an alternative explanations to dropout robustness to overfitting.
- Although the proofs in the paper made sense. It is not convincing because at infinite data limit the posterior distribution has to shrink to a point there by reducing variance. But bernoulli distributions have fixed variance (depends upon the probability chosen).