Y Gal, Z Ghahramani (2015)
Dropout as a Bayesian Approximation: Insights and Applications
https://arxiv.org/pdf/1506.02142.pdf

The crux of bayesian methods is considering parameters of the model as distributions rather than point estimates. This point of view offers us uncertainty measure for predictions.

### Background

Predictive posterior distribution: Given inputs $X$, targets $Y$, parameters $\theta$ and new input point $x^{*}$, then the predictive posterior distribtion is

$$p(y^* | x^* , \theta, X, Y) = \int p(y^* | x^*, \theta) p(\theta | X, Y) d \theta$$

The distribution $p(\theta | X, Y)$ is called posterior distribution over parameters, and it can not be evaluated analytically for complex models such as neural networks. So a distribution $q(\theta)$ is chosen from a family of "nice" distributions and KL divergence $KL(q(\theta) | p(\theta | X,Y))$ is minimized so that $q(\theta)$ is closer to posterior.

Another important result is minimizing KL divergence = maximizing log evidence lower bound,

$$L_{VI} := \int q(\theta) \ log \ p(Y | X, \theta) d \theta - KL(q(\theta) || p(\theta))$$

with respect to the variational parameters defining $q(\theta)$

### Algorithm

The authors proved that using dropout during test time is a approximate bayesian inference and it provides good uncertainty estimates. Bernoulli distribution is chosen as variational distribution $q(\theta)$.

At test time should use Monte Carlo (MC) integration with $T$ terms

$$p(y^* | x^* , \theta, X, Y) \approx \sum ^{T} _{i = 1} p(y^* | x^*, \hat{\theta _{i}} )$$ with $\hat{\theta _{i}} \sim q(\theta)$ .

Hence this is named MC dropout.