Reputation: 39
I am a bit confused about Clustering e.g. K-means clustering. I have already created clusters for the training for and in the testing part I want to know if the new points are already in the clusters or if they can be in the cluster or not? My idea is to find the center of each cluster and also find the farthest point in each cluster in training data then in testing part if the distance of the new point is great than a threshold (e.g. 1.5x the farthest point) then it cannot be in the cluster!
Is this idea efficient and correct and is there any python function to do this?
One more question: Could someone help me to understand the difference between kmeans.fit() and kmeans.predict()? I get the same result in both functions!!
I appreciate any help
Upvotes: 2
Views: 3906
Reputation: 4172
In general, when you fitting K-means algorithm, you will get cluster centers as result.
So, if you want to test to what cluster new point belong, you must calculate distance between each cluster center to the point, and label point as closest cluster center label.
If you usning scikit-learn
library
Predict(X)
method predicts the closest cluster each sample in X belongs to.
Fit(X)
- fitting the data, or in other words calculating the cluster centers.
Here is nice example how to use K-means in scikit-learn
Upvotes: 1