Jing
Jing

Reputation: 945

How do I cluster with KL-divergence?

I want to cluster my data with KL-divergence as my metric.

In K-means:

  1. Choose the number of clusters.

  2. Initialize each cluster's mean at random.

  3. Assign each data point to a cluster c with minimal distance value.

  4. Update each cluster's mean to that of the data points assigned to it.

In the Euclidean case it's easy to update the mean, just by averaging each vector.

However, if I'd like to use KL-divergence as my metric, how do I update my mean?

Upvotes: 7

Views: 7931

Answers (4)

Ben Allison
Ben Allison

Reputation: 7394

K-means is intended to work with Euclidean distance: if you want to use non-Euclidean similarities in clustering, you should use a different method. The most principled way to cluster with an arbitrary similarity metric is spectral clustering, and K-means can be derived as a variant of this where the similarities are the Euclidean distances.

And as @mitchus says, KL divergence is not a metric. You want the Jensen-Shannon divergence or its square root named as the Jensen-Shannon distance as it has symmetry.

Upvotes: 1

Fernando Ferreira
Fernando Ferreira

Reputation: 808

Well, it might not be a good idea use KL in the "k-means framework". As it was said, it is not symmetric and K-Means is intended to work on the euclidean space.

However, you can try using NMF (non-negative matrix factorization). In fact, in the book Data Clustering (Edited by Aggarwal and Reddy) you can find the prove that NMF (in a clustering task) works like k-means, only with the non-negative constrain. The fun part is that NMF may use a bunch of different distances and divergences. If you program python: scikit-learn 0.19 implements the beta divergence, which has a variable beta as a degree of liberty. Depending on the value of beta, the divergence has a different behavour. On beta equals 2, it assumes the behavior of the KL divergence.

This is actually very used in the topic model context, where people try to cluster documents/words over topics (or themes). By using KL, the results can be interpreted as a probabilistic function on how the word-topic and topic distributions are related.

You can find more information:

  • FÉVOTTE, C., IDIER, J. “Algorithms for Nonnegative Matrix Factorization with the β-Divergence”, Neural Computation, v. 23, n. 9, pp. 2421– 2456, 2011. ISSN: 0899-7667. doi: 10.1162/NECO_a_00168. Dis- ponível em: .

  • LUO, M., NIE, F., CHANG, X., et al. “Probabilistic Non-Negative Matrix Factorization and Its Robust Extensions for Topic Modeling.” In: AAAI, pp. 2308–2314, 2017.

  • KUANG, D., CHOO, J., PARK, H. “Nonnegative matrix factorization for in- teractive topic modeling and document clustering”. In: Partitional Clus- tering Algorithms, Springer, pp. 215–243, 2015.

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

Upvotes: 1

Bashar Haddad
Bashar Haddad

Reputation: 368

It is not a good idea to use KLD for two reasons:-

  1. It is not symmetry KLD(x,y) ~= KLD(y,x)
  2. You need to be careful when using KLD in programming: the division may lead to Inf values and NAN as a result.

Adding a small number may affect the accuracy.

Upvotes: 3

mitchus
mitchus

Reputation: 4877

Clustering with KL-divergence may not be the best idea, because KLD is missing an important property of metrics: symmetry. Obtained clusters could then be quite hard to interpret. If you want to go ahead with KLD, you could use as distance the average of KLD's i.e.

d(x,y) = KLD(x,y)/2 + KLD(y,x)/2

Upvotes: 7

Related Questions