Y Gal, Z Ghahramani (2015)
Dropout as a Bayesian Approximation: Insights and Applications
https://arxiv.org/pdf/1506.02142.pdf

The crux of bayesian methods is considering parameters of the model as distributions rather than point estimates. This point of view offers us uncertainty measure for predictions.

Background

Predictive posterior distribution: Given inputs $X$, targets $Y$, parameters $\theta$ and new input point $x^{*}$, then the predictive posterior distribtion is

$$p(y^* | x^* , \theta, X, Y) = \int p(y^* | x^*, \theta) p(\theta | X, Y) d \theta$$

The distribution $p(\theta | X, Y)$ is called posterior distribution over parameters, and it can not be evaluated analytically for complex models such as neural networks. So a distribution $q(\theta)$ is chosen from a family of "nice" distributions and KL divergence $KL(q(\theta) | p(\theta | X,Y))$ is minimized so that $q(\theta)$ is closer to posterior.

Another important result is minimizing KL divergence = maximizing log evidence lower bound,

$$L_{VI} := \int q(\theta) \ log \ p(Y | X, \theta) d \theta - KL(q(\theta) || p(\theta))$$

with respect to the variational parameters defining $q(\theta)$

Algorithm

The authors proved that using dropout during test time is a approximate bayesian inference and it provides good uncertainty estimates. Bernoulli distribution is chosen as variational distribution $q(\theta)$.

At test time should use Monte Carlo (MC) integration with $T$ terms

$$p(y^* | x^* , \theta, X, Y) \approx \sum ^{T} _{i = 1} p(y^* | x^*, \hat{\theta _{i}} )$$ with $\hat{\theta _{i}} \sim q(\theta)$ .

Hence this is named MC dropout.