Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009, June).
Evaluation methods for topic models. In Proceedings of the 26th annual
international conference on machine learning
(pp. 1105-1112). ACM.

TODO: Fix the PDF file links.

Download PDF

This paper presents some approaches to validating a probabilistic topic model
for a corpus of textual documents, with a focus on LDA. The main task
considered in the paper is estimating the probability of held-out documents
given a trained model, which arises when we consider the probability of
held-out documents given a set of training documents - $P(W|W')$. The paper
focuses on evaluating $P(W|\Phi,\alpha{\mathbf{m}})$.

  1. Importance Sampling to find $P(\mathbf{w}|\Phi,\alpha{\mathbf{m}})$.
  2. Harmonic mean method
  3. Annealed importance sampling, which introduces auxiliary variables to make
    the proposal distribution closer to the target distribution.
  4. Chib style estimation
  5. A "left-to-right" evaluation algorithm introduced by Wallach.

Document completion is also proposed as a way to evaluate models, by holding
out the latter half of a document, building a model based on the first half
and estimating the probability of the second half. This may not be applicable
for cases where the document length may be short ... which in turn may require
other techniques than LDA (ex: Bi-Term models).

The authors note in their experiments that the Chib-style estimator and the
"left-to-right" algorithm both performed well, whereas the others - importance
sampling and the harmonic mean method - over or underestimated the
probability. They also note that while the earlier methods including harmonic
mean method and importance sampling may result in valid rankings of models,
they may misrepresent the relative advantage of one model over the other.


  • So far, we are not aware of any implementations of these methods in popular
    machine learning packages.

  • Chang et al (PDF) - note that they found in their
    study that models which do better in held-out likelihood (discussed in the
    above paper) may infer less semantically meaningful topics.


  title={Evaluation methods for topic models},
  author={Wallach, Hanna M and Murray, Iain and Salakhutdinov, Ruslan and Mimno, David},
  booktitle={Proceedings of the 26th annual international conference on machine learning},