do.ob
do.ob

Reputation: 3

Implementing k-means algorithm correctly

I've just started to learn to code and had a go at writing the standard k-means algorithm. I tried my implementation on a data set generated with three different Gaussians and it seems to work well. However I tried it on the iris data set and every now and then (around third of the time) my function returns only two sets, in other words it returns only two clusters.

I had a peek at the code for the local MATLAB kmeans function but I ended up getting lost due to my lack of coding knowledge. I would really appreciate any help!

function [R,C,P,it] = mykmeans(X,K)
% X -- data matrix
% K -- number of clusters
% C -- partition sets
% P -- matrix of prototypes
% R -- binary indicator matrix: R(i,j) specifies whether the ith data is
% classified into jth cluster
% it -- number of iterations until convergence

% N points with M dimensions
[N,M] = size(X) ;

%% Initialisation

% At this step we randomly partition the data matrix into K equally sized
% matrices and compute the centre of each of these matrices.
% I -- randomised index vector
% v -- number of data points assigned to each cluster
% U -- randomly partitioned matrices

v = N/K ;
C = cell(K,1) ;
U = cell(K,1) ;
I = randperm(N) ;
oldR = zeros(N,K) ;

% C{1} = X(I(1:v),:) ;
% U{1} = mean(X(I(1:v),:)) ;
for k=1:K
    C{k} = X(I(1+v*(k-1):k*v),:) ;
    U{k} = mean(C{k}) ;
end

P = cell2mat(U) ;

converged = 0 ;
it = 0 ;
while converged ~= 1

    %% Assignment step

    % Each element of D{n} contains squared euclidean distance of nth data 
    % point from the kth prototype
    D = cell(N,1) ;
    R = zeros(N,K) ;
    for n=1:N
        D{n} = sum((repmat(X(n,:),K,1) - P).^2,2) ;
        [~,k] = min(D{n}) ;
        R(n,k) = 1 ; 
    end

    %% Update step

    C = cell(K,1) ; % reset C
    for k=1:K
        for n=1:N
            P(k,:) = R(n,k)*X(n,:) + P(k,:) ; % compute numerator of mean vector
            if R(n,k) == 1
                C{k} = [C{k};X(n,:)] ;
            end
        end
    end

    P = P ./ (sum(R)') ; % divide by denominator of mean vectors to get prototypes

%% Check for convergence

    if sum(sum(R == oldR))==N*K || it == 100 % convergence criteria
        converged = 1 ;
    else
        oldR = R ;
        it = it+1 ;
    end
end %while

Upvotes: 0

Views: 888

Answers (1)

Nikolas Rieble
Nikolas Rieble

Reputation: 2621

The problem indeed seems not a coding problem but an issue of understanding k-means.

In fact during k-means, clusters can become empty. You will need to account in your code for this, as otherwise the number of clusters in your result can be smaller than k.

A possible solutions could be:

  • assigning a random data point as the new cluster center for the empty cluster
  • choosing the point most far from the biggest cluster as the new cluster center for the empty cluster

The general approach therefore goes such as:

  1. Initialize k cluster centers (for example: random)
  2. Assign all data points to the closest cluster center
  3. Recompute the cluster centers based on the assignment
  4. Check for empty clusters
  5. Repeat steps 2 - 4 until convergence (== cluster centers did not change in the last iteration)

A great illustration of the issue of empty clusters can be found here.

Upvotes: 3

Related Questions