Reputation: 590
I have my own precomputed data for running AP or Kmeans in python. However when I go to run predict() as I would like to run a train() and test() on the data to see if the clusterings have a good accuracy on the class or clusters, Python tells me that predict() is not available for "precomputed" data.
Is there another way to run a train / test on clustered data in python?
Upvotes: 2
Views: 838
Reputation: 77485
Most clustering algorithms, including AP, have no well-defined way to "predict" on new data. K-means is one of the few cases simple enough to allow a "prediction" consistent with the initial clusters.
Now sklearn has this oddity of trying to squeeze everything into a supervised API. Clustering algorithms have a fit(X, y)
method, but ignore y
, and are supposed to have a predict
method even though the algorithms don't have such a capability.
For affinity propagation, someone at some point decided to add a predict
based on k-means: It always predicts the nearest center. Computing the mean only is possible with coordinate data, and hence the method fails with metric=precomputed.
If you want to replicate this behavior, computer the distances to all cluster centers, and choose the argmin, that's all. You can't fit this into the sklearn API easily with "precomputed" metrics. You could require the user to pass a distance vector to all "training" examples for the precomputed metric, but only few of them are needed...
In my opinion, I'd rather remove this method altogether:
predict
will not return the same results as the points were labeled by AP, because AP is labeling points using a "propagated responsibility", rather than the nearest "center". (The current sklearn implementation may be losing this information...)predict
anyway - it's not a requirement to have this.Note that it clustering it is not common to do any test-train split, because you don't use the labels anyway, and use only unsupervised evaluation methods (if any at all, because these have their own array of issues) if any at all - you cannot reliably do "hyperparameter optimization" here, but have to choose parameters based on experience and humans looking at the data.
Upvotes: 5