Reputation: 3616
I tried to implement the k-means algorithm for the MNIST data set. But since the result is far from good, there might be a bug (or several) I don't see at the moment. The code is really straightforward. Here is what I did so far:
import numpy as np
# Load images
I = np.load("mnist_test_images.npy").astype(float) # (10000,784)
L = np.load("mnist_test_labels.npy").astype(int) # (10000,1)
# Scale
I = 2.0*(I/255.0-0.5)
images = len(I)
# Random initialization of centers for k=10 clusters
M = np.random.randn(10,28*28)
guess = np.zeros((len(I),1))
step = 0
while (True):
# Compute distance of every image i to the center of every cluster k
# image i belongs to cluster with smallest distance
for i in range(images):
d = np.sum((M-I[i])**2,axis=1)
guess[i] = np.argmin(d)
# Update the centers for all clusters
# New center is the mean of all images i which belong to cluster k
for k in range(10):
idx, _ = np.where(guess == k)
if len(idx) > 0:
M[k] = np.mean(I[idx],axis=0)
# Test how good the algorithm works
# Very similar to first step
if (step % 10 == 0):
fitness = 0
for i in range(images):
dist = np.sum((M-I[i])**2,axis=1)
if L[i] == np.argmin(dist):
fitness += 1
print("%d" % fitness, flush=True)
step += 1
The code looks really simple. But there is probably a bug somewhere. When I test it, the accuracy drops from about 10-20% down to 5-10% or converges almost instantly not reaching more than 30%. I can not recognize any learning. Could the random initialization of the cluster's centers cause this behavior?
Thank you!
Upvotes: 0
Views: 381
Reputation: 77454
The problem is that you treat this as a supervised learning approach, but it is unsupervised. In my opinion the whole "unsupervised learning" terminology should be avoided because it can be very misleading. In fact, I wouldn't call most "unsupervised" methods to be "learning" at all.
Clustering is not just "unsupervised classification". It is a very different and much much harder task. The task is so difficult that we do not even yet know how to really evaluate it.
I'm your case there are sevral issues:
Upvotes: 4