Reputation: 748
I'm trying to learn sklearn. As I understand from step 5 of the following example, the predicted clusters can be mislabelled and it would be up to me to relabel them properly. This is also done in an example on sci-kit. Labels must be re-assigned so that the results of the clustering and the ground truth match by color.
How would I know if the labels of the predicted clusters match the initial data labels and how to readjust the indices of the labels to properly match the two sets?
Upvotes: 4
Views: 4348
Reputation: 784
With clustering, there's no meaningful order or comparison between clusters, we're just finding groups of observations that have something in common. There's no reason to refer to one cluster as 'the blue cluster' vs 'the red cluster' (unless you have some extra knowledge about the domain). For that reason, sklearn will arbitrarily assign numbers to each cluster.
print(clustering.labels_)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 2 2 2 0 2 2 2 2 2 2 2 2 0 2 2 2 2 0 2 2 2
2 0 0 0 2 2 2 2 2 2 2 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 0 0 0 0 2 0 0 0 0
0 0 2 2 0 0 0 0 2 0 2 0 2 0 0 2 0 0 0 0 0 0 2 2 0 0 0 2 0 0 0 2 0 0 0 2 0
0 2]
The labels could have just as easily replaced all of the 1s with 0s and 0s with 1s and it would still be the same set of clusters.
In this case, the numbering doesn't match the numbering that was used in the ground truth, so the colors don't match up when we look at the generated clusters with the ground truth, so we reassign them using np.choose
as it showed in the example:
relabel = np.choose(clustering.labels_,[2,0,1]).astype(np.int64)
This takes the current labels, and changes 0 to 2 (because index 0 is 2), 1 to 0 and 2 to 1. It's the same set of clusters, but we changed the (arbitrary) labeling to match up.
To answer your question about how to know when they do or don't match: clustering is a form of unsupervised learning, which means you usually won't have the ground truth at all and there's nothing that you need to worry about matching against. In this example, we knew the ground truth and we could see that the clusters didn't match up side by side, so we can choose to make the colors match if we want. We can also choose to not do so, since they're the same clusters anyway, but you may find it easier to visualize this way.
Upvotes: 4