Y Gal, Z Ghahramani (2015)
Dropout as a Bayesian Approximation: Insights and Applications

The crux of bayesian methods is considering parameters of the model as distributions rather than point estimates. This point of view offers us uncertainty measure for predictions.


Predictive posterior distribution: Given inputs $X$, targets $Y$, parameters $\theta$ and new input point $x^{*}$, then the predictive posterior distribtion is

$$ p(y^* | x^* , \theta, X, Y) = \int p(y^* | x^*, \theta) p(\theta | X, Y) d \theta$$

The distribution $p(\theta | X, Y)$ is called posterior distribution over parameters, and it can not be evaluated analytically for complex models such as neural networks. So a distribution $q(\theta)$ is chosen from a family of "nice" distributions and KL divergence $KL(q(\theta) | p(\theta | X,Y))$ is minimized so that $q(\theta)$ is closer to posterior.

Another important result is minimizing KL divergence = maximizing log evidence lower bound,

$$L_{VI} := \int q(\theta) \ log \ p(Y | X, \theta) d \theta - KL(q(\theta) || p(\theta)) $$

with respect to the variational parameters defining $q(\theta)$


The authors proved that using dropout during test time is a approximate bayesian inference and it provides good uncertainty estimates. Bernoulli distribution is chosen as variational distribution $ q(\theta) $.

At test time should use Monte Carlo (MC) integration with $T$ terms

$$ p(y^* | x^* , \theta, X, Y) \approx \sum ^{T} _{i = 1} p(y^* | x^*, \hat{\theta _{i}} )$$ with $\hat{\theta _{i}} \sim q(\theta)$ .

Hence this is named MC dropout.


  • The paper is well written and provided an alternative explanations to dropout robustness to overfitting.
  • Although the proofs in the paper made sense. It is not convincing because at infinite data limit the posterior distribution has to shrink to a point there by reducing variance. But bernoulli distributions have fixed variance (depends upon the probability chosen).