benj rei
benj rei

Reputation: 329

ELKI get clustering data points

How do I get the data points and centroid that are in a kmeans (llyod) cluster when I use elki?

Also could I plug in those points into one of the distance functions and get the distance between any two of the points?

This question is different, because the main focus of my question is retrieving the data points, not custom data points. Also the answer on the other thread is currently incomplete, since it refers to a wiki that is not functioning at the moment. Additionally I would like to know specifically what needs to be done, because the documentation on all of the libraries is a bit like a wild goose chase and it would be greatly appreciated that if you know/understand the library that you would be direct with the answer so that others with the same problem could also have a good solid reference to refer to, instead of trying to figure out the library.

Upvotes: 1

Views: 614

Answers (1)

Erich Schubert
Erich Schubert

Reputation: 8715

A Cluster (JavaDoc) in ELKI never stores the point data. It only stores point DBIDs (Wiki), which you can get using the getIDs() method. To get the original data, you need the Relation from your database. The method getModel() returns the cluster model, which for kmeans is a KMeansModel.

You can get the point data from the database Relation by their DBID, or compute the distance based on two DBIDs.

The centroid of KMeans is special - it is not a database object, but always a numerical vector - the arithmetic mean of the cluster. When using KMeans, you should be using SquaredEuclideanDistanceFunction. This is a NumberVectorDistanceFunction, which has the method distance(NumberVector o1, NumberVector o2) (not all distances work on number vectors!).

Relation<? extends NumberVector> rel = ...;
NumberDistanceFunction df = SquaredEuclideanDistanceFunction.STATIC;

... run the algorithm, then iterate over each cluster: ...

Cluster<KMeansModel> cluster = ...;
Vector center = cluster.getModel().getMean(); 
double varsum = cluster.getModel().getVarianceContribution();

double sum = 0.;
// C++-style for loop, for efficiency:
for(DBIDRef id = cluster.getIDs().iterDBIDs(); id.valid(); id.advance()) {
   double distance = df.distance(relation.get(id), center);
   sum += distance;
}

System.out.println(varsum+" should be the same as "+sum);

Upvotes: 2

Related Questions