Markos di Mitsas
Markos di Mitsas

Reputation: 125

How to correctly translate Kmeans labels to category labels

I have been using Sklearn's Kmeans implementation

I have been clustering a dataset which is labeled, and I have been using sklearn's clustering metrics in order to test the clustering performance.

Sklearn's Kmeans clustering output is as you know a list of numbers in the range of k_clusters. However my labels are strings.

So far I had no problems with them since the metrics from sklearn.metrics.cluster work with mixed inputs (int & str label lists).

However now I want to use some of the classification metrics and from what I gather, the inputs k_true and k_pred need to be of the same set. Either numbers in range of k, or then string labels that my dataset is using. If I try it, it returns the following error:

AttributeError: 'bool' object has no attribute 'sum'

So, how could I translate the k_means labels into an other type of labels? Or even the other way around (string labels -> integer labels).

How could I even begin implementing it? Since k_means is pretty non-deterministic, the labels might change from iteration to iteration. Is there a legit way in order to correctly translate Kmeans labels?

EDIT:

EXAMPLE

for k = 4

kmeans output: [0,3,3,2,........0]

class labels : ['CAT','DOG','DOG','BIRD',.......'CHICKEN']

Upvotes: 2

Views: 3398

Answers (2)

Gambit1614
Gambit1614

Reputation: 8801

You can create mapping using a dictionary, say

mapping_dict = { 0: 'cat', 1: 'chicken', 2:'bird', 3:'dog'}

Then you can simply apply this mapping using say list comprehension,etc. Suppose your labels are stored in a list kmeans_predictions

mapped_predictions = [ mapping_dict[x] for x in kmeans_predictions]

Then use mapped_predictions as your predictions

Update : Based on your comments, i believe you have to do it the other way round. I mean convert your labels into `int' mappings.

Also, you cannot use just any classification metric here. Use Completeness score, v-measure and homogenity as these are more suited for clustering problems. It would be incorrect to just blindly use any random classification metric here.

Upvotes: 1

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Clustering is not classification.

The methods do not predict a label, so you must not use a classification evaluation measure. That would be like measuring the quality of an apple in miles per gallon...

If you insist on doing the wrong thing(tm) then use the Hungarian algorithm to find the best mapping. But beware: the number of clusters and the number of classes will usually not be the same. If this is the case, using such a mapping will either be unfairly negative (not mapping extra clusters) or unfairly positive (mapping !uktiple clusters to the same label will consider the N points are N clusters "solution" optimal). It's better to only use clustering measures.

Upvotes: 1

Related Questions