Reputation: 125
I have been using Sklearn's Kmeans implementation
I have been clustering a dataset which is labeled, and I have been using sklearn's clustering metrics in order to test the clustering performance.
Sklearn's Kmeans clustering output is as you know a list of numbers in the range of k_clusters
. However my labels are strings
.
So far I had no problems with them since the metrics from sklearn.metrics.cluster
work with mixed inputs (int
& str
label lists).
However now I want to use some of the classification metrics and from what I gather, the inputs k_true
and k_pred
need to be of the same set. Either numbers in range of k
, or then string labels that my dataset is using. If I try it, it returns the following error:
AttributeError: 'bool' object has no attribute 'sum'
So, how could I translate the k_means
labels into an other type of labels? Or even the other way around (string labels -> integer labels).
How could I even begin implementing it? Since k_means is pretty non-deterministic, the labels might change from iteration to iteration. Is there a legit way in order to correctly translate Kmeans labels?
EDIT:
EXAMPLE
for k = 4
kmeans output: [0,3,3,2,........0]
class labels : ['CAT','DOG','DOG','BIRD',.......'CHICKEN']
Upvotes: 2
Views: 3398
Reputation: 8801
You can create mapping using a dictionary, say
mapping_dict = { 0: 'cat', 1: 'chicken', 2:'bird', 3:'dog'}
Then you can simply apply this mapping using say list comprehension,etc.
Suppose your labels are stored in a list kmeans_predictions
mapped_predictions = [ mapping_dict[x] for x in kmeans_predictions]
Then use mapped_predictions
as your predictions
Update : Based on your comments, i believe you have to do it the other way round. I mean convert your labels into `int' mappings.
Also, you cannot use just any classification metric here. Use Completeness score, v-measure and homogenity as these are more suited for clustering problems. It would be incorrect to just blindly use any random classification metric here.
Upvotes: 1
Reputation: 77454
Clustering is not classification.
The methods do not predict a label, so you must not use a classification evaluation measure. That would be like measuring the quality of an apple in miles per gallon...
If you insist on doing the wrong thing(tm) then use the Hungarian algorithm to find the best mapping. But beware: the number of clusters and the number of classes will usually not be the same. If this is the case, using such a mapping will either be unfairly negative (not mapping extra clusters) or unfairly positive (mapping !uktiple clusters to the same label will consider the N points are N clusters "solution" optimal). It's better to only use clustering measures.
Upvotes: 1