Evaluation Methods for Topic Models

Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009, June). Evaluation methods for topic models. In Proceedings of the 26th annual international conference on machine learning (pp. 1105-1112). ACM.

Download PDF

This paper presents some approaches to validating a probabilistic topic model for a corpus of textual documents, with a focus on LDA. The main task considered in the paper is estimating the probability of held-out documents given a trained model, which arises when we consider the probability of held-out documents given a set of training documents - $P(W|W’)$. The paper focuses on evaluating $P(W|\Phi,\alpha{\mathbf{m}})$.

  1. Importance Sampling to find $P(\mathbf{w}|\Phi,\alpha{\mathbf{m}})$.
  2. Harmonic mean method
  3. Annealed importance sampling, which introduces auxiliary variables to make the proposal distribution closer to the target distribution.
  4. Chib style estimation
  5. A “left-to-right” evaluation algorithm introduced by Wallach.

Document completion is also proposed as a way to evaluate models, by holding out the latter half of a document, building a model based on the first half and estimating the probability of the second half. This may not be applicable for cases where the document length may be short … which in turn may require other techniques than LDA (ex: Bi-Term models).

The authors note in their experiments that the Chib-style estimator and the “left-to-right” algorithm both performed well, whereas the others - importance sampling and the harmonic mean method - over or underestimated the probability. They also note that while the earlier methods including harmonic mean method and importance sampling may result in valid rankings of models, they may misrepresent the relative advantage of one model over the other.


  • So far, we are not aware of any implementations of these methods in popular machine learning packages.

  • Chang et al (PDF) - note that they found in their study that models which do better in held-out likelihood (discussed in the above paper) may infer less semantically meaningful topics.


  title={Evaluation methods for topic models},
  author={Wallach, Hanna M and Murray, Iain and Salakhutdinov, Ruslan and Mimno, David},
  booktitle={Proceedings of the 26th annual international conference on machine learning},
Srikumar Subramanian avatar
About Srikumar Subramanian, "Sri"