titusAdam
titusAdam

Reputation: 809

Centroids are not centered at clusters

I'm trying to plot the centroids and clusters of my k-means analysis, using the following code:

matrix_reduced = TruncatedSVD(n_components = num_k).fit_transform(matrix)

matrix_embedded = TSNE(n_components=2, perplexity=30,verbose=2, n_iter =500).fit_transform(matrix_reduced)

centroids = kmeans.cluster_centers_
centroids_embedded = TSNE(n_components=2).fit_transform(order_centroids)


fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot(111)


ax1.scatter(matrix_embedded[:,0], matrix_embedded[:,1],marker='x',c = kmeans.labels_)
ax1.scatter(centroids_embedded[:,0], centroids_embedded[:,1],marker='o',c = 'red')

plt.show()

Unfortunately, the centroids are not centered at the different clusters:

enter image description here

Question: Does anyone know what could cause this? I have no idea what's going wrong.

Thanks!

Upvotes: 0

Views: 894

Answers (1)

carrdelling
carrdelling

Reputation: 1725

In general, when creating any manifold you need to provide all the points you want to represent on it (as the final representation usually depends and all the points in your data).

In the example, you are creating two different manifolds:

matrix_reduced = TruncatedSVD(n_components = num_k).fit_transform(matrix)

# first manifold
matrix_embedded = TSNE(n_components=2, perplexity=30,verbose=2, n_iter =500).fit_transform(matrix_reduced)

centroids = kmeans.cluster_centers_
# second manifold
centroids_embedded = TSNE(n_components=2).fit_transform(order_centroids)

This means that the representations created are independent (and that's why you don't see them centered - they are, in fact, on a different space).

The way of fixing this is to simply join both matrix_reduced and order_centroids into a single dataset, and apply TSNE only once. That should show the result you are expecting.

Also, note that if you are using k-means on the original matrix (instead of on matrix_reduced) then the result will still be incorrect - you need to apply the same transformations to both your centroids and the data that k-means saw originally.

So in summary (and assuming you want to use TruncatedSVD before the clustering), it would work as follows:

  1. Read the dataset
  2. Apply TruncatedSVD to transform the whole dataset at one.
  3. Use k-means on the transformed dataset to get k centroids
  4. Get the centroids and concatenate them at the end of your dataset (as if they were additional examples)
  5. Apply TSNE to the whole dataset.
  6. (optional) Draw the first N-k points, as usual.
  7. (optional) Draw the last k points (your transformed centroids) in a different color.

Upvotes: 1

Related Questions