G Gr
G Gr

Reputation: 6080

Iregular plot of k-means clustering, outlier removal

Hi I'm working on trying to cluster network data from the 1999 darpa data set. Unfortunately I'm not really getting clustered data, not compared to some of the literature, using the same techniques and methods.

My data comes out like this:

Matlab Figure 1

As you can see, it is not very Clustered. This is due to a lot of outliers (noise) in the dataset. I have looked at some outlier removal techniques but nothing I have tried so far really cleans the data. One of the methods I have tried:

%% When an outlier is considered to be more than three standard deviations away from the mean, determine the number of outliers in each column of the count matrix:

    mu = mean(data)
    sigma = std(data)
    [n,p] = size(data);
    % Create a matrix of mean values by replicating the mu vector for n rows
    MeanMat = repmat(mu,n,1);
    % Create a matrix of standard deviation values by replicating the sigma vector for n rows
    SigmaMat = repmat(sigma,n,1);
    % Create a matrix of zeros and ones, where ones indicate the location of outliers
    outliers = abs(data - MeanMat) > 3*SigmaMat;
    % Calculate the number of outliers in each column
    nout = sum(outliers) 
    % To remove an entire row of data containing the outlier
    data(any(outliers,2),:) = [];

In the first run, it removed 48 rows from the 1000 normalized random rows which are selected from the full dataset.

This is the full script I used on the data:

    %% load data
        %# read the list of features
        fid = fopen('kddcup.names','rt');
        C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
        fclose(fid);

        %# determine type of features
        C{2} = regexprep(C{2}, '.$','');              %# remove "." at the end
        attribNom = [ismember(C{2},'symbolic');true]; %# nominal features

        %# build format string used to read/parse the actual data
        frmt = cell(1,numel(C{1}));
        frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
        frmt( ismember(C{2},'symbolic') ) = {'%s'};   %# nominal features: read as string
        frmt = [frmt{:}];
        frmt = [frmt '%s'];                           %# add the class attribute

        %# read dataset
        fid = fopen('kddcup.data_10_percent_corrected','rt');
        C = textscan(fid, frmt, 'Delimiter',',');
        fclose(fid);

        %# convert nominal attributes to numeric
        ind = find(attribNom);
        G = cell(numel(ind),1);
        for i=1:numel(ind)
            [C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
        end

        %# all numeric dataset
        fulldata = cell2mat(C);

%% dimensionality reduction 
columns = 6
[U,S,V]=svds(fulldata,columns);

%% randomly select dataset
rows = 1000;
columns = 6;

%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';

%# pick random columns
indY = indY(1:columns);

%# filter data
data = U(indX,indY);

% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));

% output matching data
dataSample = fulldata(indX, :)

%% When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:

mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 2.5*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers) 
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];

%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;

%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);

%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);

%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);

This is two distinct clusters from the output:

enter image description here

As you can see the data looks cleaner and more clustered than the original. However I still think a better method can be used.

For instance observing the overall clustering, I still have a lot of noise (outliers) from the dataset. As can be seen here:

enter image description here

I need the outlier rows put into a seperate dataset for later classification (only removed from the clustering)

Here is a link to the darpa dataset, please note that the 10% data set has had significant reduction in columns, a majority of columns which have 0 or 1's running through-out have been removed (42 columns to 6 columns):

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

EDIT

Columns kept in the dataset are:

src_bytes: continuous.

dst_bytes: continuous.

count: continuous.

srv_count: continuous.  

dst_host_count: continuous.

dst_host_srv_count: continuous.         

RE-EDIT:

Based on discussions with Anony-Mousse and his answer, there may be a way of reducing noise in the clustering using K-Medoids http://en.wikipedia.org/wiki/K-medoids. I'm hoping that there isnt much of a change in the code that I currently have but as of yet I do not know how to implement it to test whether this will significantly reduce the noise. So providing that someone can show me a working example this will be accepted as an answer.

Upvotes: 8

Views: 6362

Answers (2)

Rody Oldenhuis
Rody Oldenhuis

Reputation: 38032

First things first: you're asking for a lot here. For future reference: try to break up your problem in smaller chunks, and post several questions. This increases your chances of getting answers (and doesn't cost you 400 reputation!).

Luckily for you, I understand your predicament, and just love this sort of question!

Apart from this dataset's possible issues with k-means, this question is still generic enough to apply also to other datasets (and thus Googlers ending up here looking for a similar thing), so let's go ahead and get this solved.

My suggestion is we edit this answer until you get reasonably satisfactory results.

Number of clusters

Step 1 of any clustering problem: how many clusters to choose? There are a few methods I know of with which you can select the proper number of clusters. There is a nice wiki page about this, containing all of the methods below (and a few more).

Visual inspection

It might seem silly, but if you have well-separated data, a simple plot can tell you already (approximately) how many clusters you'll need, just by looking.

Pros:

  • quick
  • simple
  • works well on well-separated clusters in relatively small datasets

Cons:

  • and dirty
  • requires user interaction
  • it's easy to miss smaller clusters
  • data with less-well separated clusters, or very many of them, are hard to do by this method
  • it is all rather subjective -- the next person might select a different amount than you did.

silhouettes plot

As indicated in one of your other questions, making a silhouettes plot will help you make a better decision about the proper number of clusters in your data.

Pros:

  • relatively simple
  • reduces subjectivity by using statistical measures
  • intuitive way to represent quality of the choice

Cons:

  • requires user interaction
  • In the limit, if you take as many clusters as there are datapoints, a silhouettes plot will tell you that that is the best choice
  • it is still rather subjective, not based on statistical means
  • can be computationally expensive

elbow method

As with the silhouettes plot approach, you run kmeans repeatedly, each time with a larger amount of clusters, and you see how much of the total variance in the data is explained by the clusters chosen by this kmeans run. There will be a number of clusters where the amount of explaned variance will suddenly increase a lot less than in any of the previous choices of the number of clusters (the "elbow"). The elbow is then statistically speaking the best choice for the number of clusters.

Pros:

  • no user interaction required -- the elbow can be selected automatically
  • statistically more sound than any of the aforementioned methods

Cons:

  • somewhat complicated
  • still subjective, since the definition of the "elbow" depends on subjectively chosen parameters
  • can be computationally expensive

Outliers

Once you have chosen the number of clusters with any of the methods above, it is time to do outlier detection to see if the quality of your clusters improves.

I would start by a two-step-iterative approach, using the elbow method. In pseudo-Matlab:

data = your initial dataset
dataMod = your initial dataset

MAX = the number of clusters chosen by visual inspection

while (forever)

    for N = MAX-5 : MAX+5
        if (N < 1), continue, end
        perform k-means with N clusters on dataMod
        if (variance explained shows a jump)
            break
    end

    if (you are satisfied)
        break
    end

    for i = 1:N
        extract all points from cluster i 
        find the centroid (let k-means do that)
        calculate the standard deviation of distances to the centroid
        mark points further than 3 sigma as possible outliers
    end

    dataMod = data with marked points removed

end

The tough part is obviously determining whether you are satisfied. This is the key to the algorithm's effectiveness. The rough structure of this part

if (you are satisfied)
    break
end

would be something like this

if (situation has improved)
    data = dataMod

elseif (situation is same or worse)
    dataMod = data
    break            
end

the situation has improved when there are fewer outliers, or the variance explaned for ALL choices of N is better than during the previous loop in the while. This is also something to fiddle with.

Anyway, much more than a first attempt I wouldn't call this. If anyone sees incompletenesses, flaws or loopholes here, please comment or edit.

Upvotes: 9

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77474

Note that using this dataset is discouraged:

That dataset has errors: KDD Cup '99 dataset (Network Intrusion) considered harmful

Reconsider using a different algorithm. k-means is not really appropriate for mixed-type data, where many attributes are discrete, and have very different scales. K-means needs to be able to compute sensible means. And for a binary vector "0.5" is not a sensible mean, it should be either 0 or 1.

Plus, k-means doesn't like outliers too much.

When plotting, make sure to scale them equally, or the result will look incorrect. You X-axis has a length of around 0.9, your y axis only 0.2 - no wonder they look squashed.

Overall, maybe the data set just doesn't have k-means-style clusters? You definitely should try a density-based methods (because these can deal with outliers) such as DBSCAN. But judging from the visualizations you added, I'd say it has at most 4-5 clusters, and they are not really interesting. They probably can be captured with a number of thresholds in some dimensions.

Parallel coordinates

Here is a visualization of the data set after z-normalization, visualized in parallel coordinates, with 5000 samples. Bright green is normal.

You can clearly see special properties of the data set. All of the attacks are clearly different in attributes 3 and 4 (count and srv_count) and also most very concentrated in dst_host_count and dst_host_srv_count.

I've ran OPTICS on this data set, too. It found a number of clusters, most of them in the wine-colored attack pattern. But they're not really interesting. If you have 10 different hosts ping-flooding, they will form 10 clusters.

OPTICS Clusters

You can see very well that OPTICS managed to cluster a number of these attacks. It missed all the orange stuff (maybe if I had set minpts lower, it is quite spread out) but it even discovered *structure within the wine-colored attack), breaking it into a number of separate events.

To really make sense of this data set, you should start with feature extraction, for example by merging such ping flood connection attempts into an aggregate event.

Also note that this is an unrealistic scenario.

  1. There are well-known patterns involved in attacks, in particular port scans. These are best detected with a specialized port scan detector, not with learning.
  2. The simulated data has a lot of completely pointless "attacks" simulated. For example Smurf attack from the 90s, is >50% of the data set, and Syn flood is another 20%; while normal traffic is <20%!
  3. For these kind of attacks, there are well-known signatures.
  4. Much of modern attacks (SQL injection, for example) flow with usual HTTP traffic, and will not show up anomalous in raw traffic pattern.

Just don't use this data for classification or outlier detection. Just don't.

Quoting the KDNuggets link above:

As a result, we strongly recommend that

(1) all researchers stop using the KDD Cup '99 dataset,

(2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset, and

(3) peer reviewers for conferences and journals ding papers (or even outright reject them, as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset.

This is neither real nor realistic data. Go get something else.

Upvotes: 10

Related Questions