ndemir
ndemir

Reputation: 1911

How to find characteristics of the clusters with mahout?

I am using mahout 0.8 and after clustering a data, i use this command to see results:

mahout clusterdump --seqFileDir clusters/clusters-77/ --pointsDir clusters/clusteredPoints/

Also i want to learn why rows are clustered in the same cluster. I think, to learn this i can write code to find which features/dimensions are similar in a cluster.

Without writing code, can i find why rows are clustered in the same cluster?

In a nutshell: I want to learn the characteristics of the clusters.

Upvotes: 1

Views: 251

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Many clustering algorithms will not provide an explanation. And even if they did, the answer would probably be little more than "because cluster center X is the closest". In particular k-means is a numerical optimization method that can be written as searching a (local) minimum of a particular mathematical equation. So in essence, the reply then is because this cluster assignment minimizes the given equation.

To some extend, this is inherent to the problem: clustering is an unsupervised technique, usually based on concepts such as minimizing an equation or computing a graph subset (e.g. in density based clustering, DBSCAN can be seen as finding density-connected subgraphs)

Now when going into "big data", explanations are of little interest. If you have just a few dozen points, explanations are good. If you have billions, who is going to look at the explanations (if they would exist) anyway? In systems such as Mahout, often not even the exact solution is computed, but an approximation. If you need to be as fast as possible and are willing to discard precision, then you are probably also willing to discard explanations.

If you want to learn more about the clusters, you can either

  • inspect them post-clustering with your own methods
  • use a smaller data size and a more complex algorithm that provides explanations

And if your data set is small enough to be processed on a single system, I'd not use Mahout in the first place. It's sensible only for really huge data sets. All the Hadoop stuff does cost some overhead that you don't need in a single-computer setting.

Upvotes: 2

Related Questions