DaveK
DaveK

Reputation: 590

How to run predict() on "precomputed" data for clustering in python

I have my own precomputed data for running AP or Kmeans in python. However when I go to run predict() as I would like to run a train() and test() on the data to see if the clusterings have a good accuracy on the class or clusters, Python tells me that predict() is not available for "precomputed" data.

Is there another way to run a train / test on clustered data in python?

Upvotes: 2

Views: 838

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77485

Most clustering algorithms, including AP, have no well-defined way to "predict" on new data. K-means is one of the few cases simple enough to allow a "prediction" consistent with the initial clusters.

Now sklearn has this oddity of trying to squeeze everything into a supervised API. Clustering algorithms have a fit(X, y) method, but ignore y, and are supposed to have a predict method even though the algorithms don't have such a capability.

For affinity propagation, someone at some point decided to add a predict based on k-means: It always predicts the nearest center. Computing the mean only is possible with coordinate data, and hence the method fails with metric=precomputed. If you want to replicate this behavior, computer the distances to all cluster centers, and choose the argmin, that's all. You can't fit this into the sklearn API easily with "precomputed" metrics. You could require the user to pass a distance vector to all "training" examples for the precomputed metric, but only few of them are needed...

In my opinion, I'd rather remove this method altogether:

  1. It is not in published research on affinity propagation that I know
  2. Affinity propagation is based on concepts of similarity ("affinity") not on distance or means
  3. This predict will not return the same results as the points were labeled by AP, because AP is labeling points using a "propagated responsibility", rather than the nearest "center". (The current sklearn implementation may be losing this information...)
  4. Clustering methods don't have a consistent predict anyway - it's not a requirement to have this.
  5. If you want to do this kind of prediction, just pass the cluster centers to a nearest neighbor classifier. That is what is re-implemented here, a hidden NN classifier. So you get more flexibility if you make prediction a second (classification) step.

Note that it clustering it is not common to do any test-train split, because you don't use the labels anyway, and use only unsupervised evaluation methods (if any at all, because these have their own array of issues) if any at all - you cannot reliably do "hyperparameter optimization" here, but have to choose parameters based on experience and humans looking at the data.

Upvotes: 5

Related Questions