The concept of uncertainty is introduced in Machine learning models to give us a measure of how confident the model is about its predictions. A brief survey of the recent advances in this topic is available in our previous blog post. In this post, we list out few guidelines as to how to incorporate the concept of uncertainty in our existing projects.

For our projects involving entity extractions from documents, the techniques being used are based on Sequence tagging. Depending on whether we use a single model or an ensemble of models, we have the following metrics for uncertainty.

### Single model

When using a single model, some basic measures of uncertainties that can be used are:

• If the prediction probability of the most likely label sequence is not high, then the model is inferred to be uncertain. Specifically the uncertainty value $\phi(\mathbf{x})$ for a given sample $\mathbf{x}$ is defined as $\phi(\mathbf{x})=1-P\left(\mathbf{y}^{*} | \mathbf{x} ; \theta\right)$ where $\mathbf{y}^{*}$ is the best predicted label.
• Difference between the probability of the best and the second best label sequences $\phi(\mathbf{x})=-\left( P\left(\mathbf{y_1}^{*} | \mathbf{x} ; \theta \right) - P\left (\mathbf{y_2}^{*} | \mathbf{x} ; \theta \right) \right)$
• The entropy of the predicted output sequence. Specifically the average entropy of each label output in a given sequence referred to as token entropy is widely used as a measure of uncertainty $\phi^{T E}(\mathbf{x})=-\frac{1}{T} \sum_{t=1}^{T} \sum_{m=1}^{M} P_{\theta}\left(y_{t}=m\right) \log P_{\theta}\left(y_{t}=m\right)$ where $T$ is the length of the sequence. For CRFs and HMMs, these marginals required to compute the entropy can be obtained using a forward backward algorithm.

### Ensemble of models

When using a ensemble of models, some basic measures of uncertainties that can be used are:

• Vote based entropy:  For each label, the fraction of models that vote for that particular label is computed. Then the average entropy of such empirical distribution is used as a measure of uncertainty $\phi^{V E}(\mathbf{x})=-\frac{1}{T} \sum_{t=1}^{T} \sum_{m=1}^{M} \frac{V\left(y_{t}, m\right)}{C} \log \frac{V\left(y_{t}, m\right)}{C}$ where $V\left(y_{t}, m\right)$ is the number of votes label $m$ receives from all the $C$ models considered.
• KL divergence based uncertainty: This measures how the average prediction of the ensemble of models agrees with each of the individual model predictions for each label in the sequence. $\phi^{K L}(\mathbf{x})=\frac{1}{T} \sum_{t=1}^{T} \frac{1}{C} \sum_{c=1}^{C} D\left(p^{(c)} | {p}^{ensemble}\right)$ where $D\left(p^{(c)} | {p}^{ensemble}\right)$ is the KL divergence between the prediction of a particular model $c$ and the overall ensemble prediction.

The ensemble of models can either be obtained by training multiple models independently or by obtaining different realisations of a trained Bayesian model such as a Bayesian CRF or a Bayesian RNN.

## Useful applications of Uncertainty estimates

### Active Learning

The data requirements for Deep Learning algorithms are enormous. Generally obtaining such large quantites of labelled data is uncommon and data labelling is long, laborious and expensive process. Active learning is a framework where the system could learn from relatively small amounts of data and choose by itself what data it would like the human expert to label.

In active learning, a model is trained on a small labelled data set (initial training set) and an acquisition function (often based on model's uncertainty estimates) decides which data points to ask the user to label. The acquisition function selects one or more points from a pool of unlabelled data points, with the pool points lying outside of the training set. A human expert labels the selected data points, these are added to the training set, and a new model is trained on the updated training set. This process is then repeated, with the training set increasing in size over time. The advantage of such systems is that they often result in dramatic reductions in the amount of labelling required to train an ML system (and therefore cost and time) by efficiently using the labelling budget.

#### Acquisition Functions

Given a model $\mathcal{M}$, pool data $D_{pool}$ and inputs $x \in D_{pool}$, an acquisition function $a(x, \mathcal{M})$ is a function of $x$ that the active learning system uses to decide where to query next:

$$x^{*} = argmax_{x \in D_{pool}} a(x, \mathcal{M})$$

There are many acquisition functions but only a few are applicable to big data like images, has explained a few acquisition functions. One such acquisition function is called BALD that has been applied to the task of image recognition.

BALD uses an acquisition function that estimates the mutual information between the model predictions and the model parameters. Intuitively, it captures how strongly the model predictions for a given data point and the model parameters are coupled, implying that finding out about the true label of data points with high mutual information would also inform us about the true model parameters. BALD is defined as,

$$\mathbb { I } \left( y ; \boldsymbol { \omega } | \boldsymbol { x } , \mathcal { D } _ { \text { train } } \right) = \mathbb { H } \left( y | \boldsymbol { x } , \mathcal { D } _ { \text { train } } \right) - \mathbb { E } _ { \mathrm { p } \left( \boldsymbol { \omega } | \mathcal { D } _ { \text { train } } \right) } \left[ \mathbb { H } \left( y | \boldsymbol { x } , \boldsymbol { \omega } , \mathcal { D } _ { \text { train } } \right) \right]$$

Looking at the RHS part of the equation, the first part of it is entropy of the model prediction and the second part is the average entropy of model predictions over posterior of the model parameters. So for the LHS to be high, the first part of RHS has to be high and second part has to be low. To understand this, sample 10 models and get predictions from each of them. The value of second term is low only when each model gives a confident prediction that disagrees with other predictions. The first term is entropy of the average of these predictions and we can see that the first term would be high.

We can approximate the posterior distribution $p(\omega | \mathcal { D } _ { \text { train }})$ with $q(\omega)$ as described in blog post-1 and blog post-2. Now we can write the acquisition function as follows

$$\mathbb { I } \left[ y , \boldsymbol { \omega } | \mathbf { x } , \mathcal { D } _ { \text { train } } \right] : = \mathbb { H } \left[ y | \mathbf { x } , \mathcal { D } _ { \text { train } } \right] - \mathbb { E } _ { p \left( \boldsymbol { \omega } | \mathcal { D } _ { \text { train } } \right) } [ \mathbb { H } [ y | \mathbf { x } , \boldsymbol { \omega } ] ] = - \sum _ { c } p \left( y = c | \mathbf { x } , \mathcal { D } _ { \text { train } } \right) \log p \left( y = c | \mathbf { x } , \mathcal { D } _ { \text { train } } \right) + \mathbb { E } _ { p \left( \boldsymbol { \omega } | \mathcal { D } _ { \text { train } } \right) } \left[ \sum _ { c } p ( y = c | \mathbf { x } , \boldsymbol { \omega } ) \log p ( y = c | \mathbf { x } , \boldsymbol { \omega } ) \right]$$

Swapping $p(\omega | \mathcal { D } _ { \text { train }})$ with $q(\omega)$ in the above equation and with monte carlo sampling we get,

$$\approx - \sum _ { c } \int p ( y = c | \mathbf { x } , \boldsymbol { \omega } ) q _ { \theta } ( \boldsymbol { \omega } ) \mathrm { d } \boldsymbol { \omega } \cdot \log \int p ( y = c | \mathbf { x } , \boldsymbol { \omega } ) q _ { \theta } ( \boldsymbol { \omega } ) \mathrm { d } \boldsymbol { \omega } + \mathbb { E } _ { q _ { \theta } ( \boldsymbol { \omega } ) } \left[ \sum _ { c } p ( y = c | \mathbf { x } , \boldsymbol { \omega } ) \log p ( y = c | \mathbf { x } , \boldsymbol { \omega } ) \right]$$

Consider the case of image classification in active leaning framework. When a new batch of images arrive we label those images that maximize the approimate acquisition function.