Iregular plot of k-means clustering, outlier removal

Question

Hi I'm working on trying to cluster network data from the 1999 darpa data set. Unfortunately I'm not really getting clustered data, not compared to some of the literature, using the same techniques and methods.

My data comes out like this:

Matlab Figure 1

As you can see, it is not very Clustered. This is due to a lot of outliers (noise) in the dataset. I have looked at some outlier removal techniques but nothing I have tried so far really cleans the data. One of the methods I have tried:

%% When an outlier is considered to be more than three standard deviations away from the mean, determine the number of outliers in each column of the count matrix:

    mu = mean(data)
    sigma = std(data)
    [n,p] = size(data);
    % Create a matrix of mean values by replicating the mu vector for n rows
    MeanMat = repmat(mu,n,1);
    % Create a matrix of standard deviation values by replicating the sigma vector for n rows
    SigmaMat = repmat(sigma,n,1);
    % Create a matrix of zeros and ones, where ones indicate the location of outliers
    outliers = abs(data - MeanMat) > 3*SigmaMat;
    % Calculate the number of outliers in each column
    nout = sum(outliers) 
    % To remove an entire row of data containing the outlier
    data(any(outliers,2),:) = [];

In the first run, it removed 48 rows from the 1000 normalized random rows which are selected from the full dataset.

This is the full script I used on the data:

    %% load data
        %# read the list of features
        fid = fopen('kddcup.names','rt');
        C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
        fclose(fid);

        %# determine type of features
        C{2} = regexprep(C{2}, '.$','');              %# remove "." at the end
        attribNom = [ismember(C{2},'symbolic');true]; %# nominal features

        %# build format string used to read/parse the actual data
        frmt = cell(1,numel(C{1}));
        frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
        frmt( ismember(C{2},'symbolic') ) = {'%s'};   %# nominal features: read as string
        frmt = [frmt{:}];
        frmt = [frmt '%s'];                           %# add the class attribute

        %# read dataset
        fid = fopen('kddcup.data_10_percent_corrected','rt');
        C = textscan(fid, frmt, 'Delimiter',',');
        fclose(fid);

        %# convert nominal attributes to numeric
        ind = find(attribNom);
        G = cell(numel(ind),1);
        for i=1:numel(ind)
            [C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
        end

        %# all numeric dataset
        fulldata = cell2mat(C);

%% dimensionality reduction 
columns = 6
[U,S,V]=svds(fulldata,columns);

%% randomly select dataset
rows = 1000;
columns = 6;

%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';

%# pick random columns
indY = indY(1:columns);

%# filter data
data = U(indX,indY);

% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));

% output matching data
dataSample = fulldata(indX, :)

%% When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:

mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 2.5*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers) 
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];

%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;

%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);

%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);

%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);

This is two distinct clusters from the output:

enter image description here

As you can see the data looks cleaner and more clustered than the original. However I still think a better method can be used.

For instance observing the overall clustering, I still have a lot of noise (outliers) from the dataset. As can be seen here:

enter image description here

I need the outlier rows put into a seperate dataset for later classification (only removed from the clustering)

Here is a link to the darpa dataset, please note that the 10% data set has had significant reduction in columns, a majority of columns which have 0 or 1's running through-out have been removed (42 columns to 6 columns):

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

EDIT

Columns kept in the dataset are:

src_bytes: continuous.

dst_bytes: continuous.

count: continuous.

srv_count: continuous.  

dst_host_count: continuous.

dst_host_srv_count: continuous.

RE-EDIT:

Based on discussions with Anony-Mousse and his answer, there may be a way of reducing noise in the clustering using K-Medoids http://en.wikipedia.org/wiki/K-medoids. I'm hoping that there isnt much of a change in the code that I currently have but as of yet I do not know how to implement it to test whether this will significantly reduce the noise. So providing that someone can show me a working example this will be accepted as an answer.

Rody Oldenhuis · Accepted Answer

First things first: you're asking for a lot here. For future reference: try to break up your problem in smaller chunks, and post several questions. This increases your chances of getting answers (and doesn't cost you 400 reputation!).

Luckily for you, I understand your predicament, and just love this sort of question!

Apart from this dataset's possible issues with k-means, this question is still generic enough to apply also to other datasets (and thus Googlers ending up here looking for a similar thing), so let's go ahead and get this solved.

My suggestion is we edit this answer until you get reasonably satisfactory results.

Number of clusters

Step 1 of any clustering problem: how many clusters to choose? There are a few methods I know of with which you can select the proper number of clusters. There is a nice wiki page about this, containing all of the methods below (and a few more).

Visual inspection

It might seem silly, but if you have well-separated data, a simple plot can tell you already (approximately) how many clusters you'll need, just by looking.

Pros:

quick
simple
works well on well-separated clusters in relatively small datasets

Cons:

and dirty
requires user interaction
it's easy to miss smaller clusters
data with less-well separated clusters, or very many of them, are hard to do by this method
it is all rather subjective -- the next person might select a different amount than you did.

silhouettes plot

As indicated in one of your other questions, making a silhouettes plot will help you make a better decision about the proper number of clusters in your data.

Pros:

relatively simple
reduces subjectivity by using statistical measures
intuitive way to represent quality of the choice

Cons:

requires user interaction
In the limit, if you take as many clusters as there are datapoints, a silhouettes plot will tell you that that is the best choice
it is still rather subjective, not based on statistical means
can be computationally expensive

elbow method

As with the silhouettes plot approach, you run kmeans repeatedly, each time with a larger amount of clusters, and you see how much of the total variance in the data is explained by the clusters chosen by this kmeans run. There will be a number of clusters where the amount of explaned variance will suddenly increase a lot less than in any of the previous choices of the number of clusters (the "elbow"). The elbow is then statistically speaking the best choice for the number of clusters.

Pros:

no user interaction required -- the elbow can be selected automatically
statistically more sound than any of the aforementioned methods

Cons:

somewhat complicated
still subjective, since the definition of the "elbow" depends on subjectively chosen parameters
can be computationally expensive

Outliers

Once you have chosen the number of clusters with any of the methods above, it is time to do outlier detection to see if the quality of your clusters improves.

I would start by a two-step-iterative approach, using the elbow method. In pseudo-Matlab:

data = your initial dataset
dataMod = your initial dataset

MAX = the number of clusters chosen by visual inspection

while (forever)

    for N = MAX-5 : MAX+5
        if (N < 1), continue, end
        perform k-means with N clusters on dataMod
        if (variance explained shows a jump)
            break
    end

    if (you are satisfied)
        break
    end

    for i = 1:N
        extract all points from cluster i 
        find the centroid (let k-means do that)
        calculate the standard deviation of distances to the centroid
        mark points further than 3 sigma as possible outliers
    end

    dataMod = data with marked points removed

end

The tough part is obviously determining whether you are satisfied. This is the key to the algorithm's effectiveness. The rough structure of this part

if (you are satisfied)
    break
end

would be something like this

if (situation has improved)
    data = dataMod

elseif (situation is same or worse)
    dataMod = data
    break            
end

the situation has improved when there are fewer outliers, or the variance explaned for ALL choices of N is better than during the previous loop in the while. This is also something to fiddle with.

Anyway, much more than a first attempt I wouldn't call this. If anyone sees incompletenesses, flaws or loopholes here, please comment or edit.

Iregular plot of k-means clustering, outlier removal

Answers (2)

Number of clusters

Visual inspection

silhouettes plot

elbow method

Outliers

Related Questions