Data clustering is done in a feature space which places emphasis on distances and dissimilarity between the samples in the data. Clustering images and text using handcrafted features is not always suitable and feasible since there are many possible clustering objectives. Hence the authors propose a data driven approach to jointly learn the feature space and cluster memberships.
They define a non-linear mapping using a deep neural network from data space $X$ to a lower dimensional feature space $Z$, where the clustering objective is optimized using SGD.

The clustering is done in two phases:

Phase-1: Parameter initialization with deep denoising autoencoder
Phase-2: Parameter optimization (clustering)

Phase-1: Parameter Initialization:

A stacked autoencoder is trained layer-wise greedily to minimize reconstruction loss. The encoder part of the autoencoder is used in the Phase-2. Initial cluster centroids are obtained by passing datapoints through the encoder and then performing k-means clustering on the resulting features $Z$.

Phase-2: Clustering with KL-divergence:

In this phase we jointly optimize and update the initial non-linear mapping $f_{\theta}$ and initial cluster centroids. It consists of two steps 1) Soft assignment 2) KL-divergence minimization, these steps are repeated until convergence criterion is met.

Step 1: Soft Assignment:

Compute a soft assignment between the embedded points and the cluster centroids.

  • Student’s t-distribution is used as a kernel to measure the similarity between embedded point and centroid.


Step 2: KL Divergence Minimization:

Update the non-linear mapping $f_{\theta}$ and refine the cluster centroids by learnin from current high confidence assignments using an auxiliary target distribution.

The objective is defined as a KL divergence loss between the soft assignments $q_i$ and the auxiliary distribution $p_i$ as:


Choosing the right target distribution $P$ is very important. $P$ should have following properties:

  • Strengthen predictions (i.e., improve cluster purity)
  • Place more emphasis on data points assigned with high confidence
  • Normalize loss contribution of each centroid to prevent large clusters from distorting the feature space.