Reputation: 6080
Hi I'm working on trying to cluster network data from the 1999 darpa data set. Unfortunately I'm not really getting clustered data, not compared to some of the literature, using the same techniques and methods.
My data comes out like this:
As you can see, it is not very Clustered. This is due to a lot of outliers (noise) in the dataset. I have looked at some outlier removal techniques but nothing I have tried so far really cleans the data. One of the methods I have tried:
%% When an outlier is considered to be more than three standard deviations away from the mean, determine the number of outliers in each column of the count matrix:
mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 3*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers)
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];
In the first run, it removed 48 rows from the 1000 normalized random rows which are selected from the full dataset.
This is the full script I used on the data:
%% load data
%# read the list of features
fid = fopen('kddcup.names','rt');
C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
fclose(fid);
%# determine type of features
C{2} = regexprep(C{2}, '.$',''); %# remove "." at the end
attribNom = [ismember(C{2},'symbolic');true]; %# nominal features
%# build format string used to read/parse the actual data
frmt = cell(1,numel(C{1}));
frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
frmt( ismember(C{2},'symbolic') ) = {'%s'}; %# nominal features: read as string
frmt = [frmt{:}];
frmt = [frmt '%s']; %# add the class attribute
%# read dataset
fid = fopen('kddcup.data_10_percent_corrected','rt');
C = textscan(fid, frmt, 'Delimiter',',');
fclose(fid);
%# convert nominal attributes to numeric
ind = find(attribNom);
G = cell(numel(ind),1);
for i=1:numel(ind)
[C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
end
%# all numeric dataset
fulldata = cell2mat(C);
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';
%# pick random columns
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));
% output matching data
dataSample = fulldata(indX, :)
%% When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:
mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 2.5*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers)
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];
%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
This is two distinct clusters from the output:
As you can see the data looks cleaner and more clustered than the original. However I still think a better method can be used.
For instance observing the overall clustering, I still have a lot of noise (outliers) from the dataset. As can be seen here:
I need the outlier rows put into a seperate dataset for later classification (only removed from the clustering)
Here is a link to the darpa dataset, please note that the 10% data set has had significant reduction in columns, a majority of columns which have 0 or 1's running through-out have been removed (42 columns to 6 columns):
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
EDIT
Columns kept in the dataset are:
src_bytes: continuous.
dst_bytes: continuous.
count: continuous.
srv_count: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
RE-EDIT:
Based on discussions with Anony-Mousse and his answer, there may be a way of reducing noise in the clustering using K-Medoids http://en.wikipedia.org/wiki/K-medoids. I'm hoping that there isnt much of a change in the code that I currently have but as of yet I do not know how to implement it to test whether this will significantly reduce the noise. So providing that someone can show me a working example this will be accepted as an answer.
Upvotes: 8
Views: 6362
Reputation: 38032
First things first: you're asking for a lot here. For future reference: try to break up your problem in smaller chunks, and post several questions. This increases your chances of getting answers (and doesn't cost you 400 reputation!).
Luckily for you, I understand your predicament, and just love this sort of question!
Apart from this dataset's possible issues with k-means, this question is still generic enough to apply also to other datasets (and thus Googlers ending up here looking for a similar thing), so let's go ahead and get this solved.
My suggestion is we edit this answer until you get reasonably satisfactory results.
Step 1 of any clustering problem: how many clusters to choose? There are a few methods I know of with which you can select the proper number of clusters. There is a nice wiki page about this, containing all of the methods below (and a few more).
It might seem silly, but if you have well-separated data, a simple plot can tell you already (approximately) how many clusters you'll need, just by looking.
Pros:
Cons:
As indicated in one of your other questions, making a silhouettes
plot will help you make a better decision about the proper number of clusters in your data.
Pros:
Cons:
As with the silhouettes plot approach, you run kmeans
repeatedly, each time with a larger amount of clusters, and you see how much of the total variance in the data is explained by the clusters chosen by this kmeans
run. There will be a number of clusters where the amount of explaned variance will suddenly increase a lot less than in any of the previous choices of the number of clusters (the "elbow"). The elbow is then statistically speaking the best choice for the number of clusters.
Pros:
Cons:
Once you have chosen the number of clusters with any of the methods above, it is time to do outlier detection to see if the quality of your clusters improves.
I would start by a two-step-iterative approach, using the elbow method. In pseudo-Matlab:
data = your initial dataset
dataMod = your initial dataset
MAX = the number of clusters chosen by visual inspection
while (forever)
for N = MAX-5 : MAX+5
if (N < 1), continue, end
perform k-means with N clusters on dataMod
if (variance explained shows a jump)
break
end
if (you are satisfied)
break
end
for i = 1:N
extract all points from cluster i
find the centroid (let k-means do that)
calculate the standard deviation of distances to the centroid
mark points further than 3 sigma as possible outliers
end
dataMod = data with marked points removed
end
The tough part is obviously determining whether you are satisfied
.
This is the key to the algorithm's effectiveness. The rough structure of
this part
if (you are satisfied)
break
end
would be something like this
if (situation has improved)
data = dataMod
elseif (situation is same or worse)
dataMod = data
break
end
the situation has improved
when there are fewer outliers, or the variance
explaned for ALL choices of N
is better than during the previous loop in the while
. This is also something to fiddle with.
Anyway, much more than a first attempt I wouldn't call this. If anyone sees incompletenesses, flaws or loopholes here, please comment or edit.
Upvotes: 9
Reputation: 77474
Note that using this dataset is discouraged:
That dataset has errors: KDD Cup '99 dataset (Network Intrusion) considered harmful
Reconsider using a different algorithm. k-means is not really appropriate for mixed-type data, where many attributes are discrete, and have very different scales. K-means needs to be able to compute sensible means. And for a binary vector "0.5" is not a sensible mean, it should be either 0 or 1.
Plus, k-means doesn't like outliers too much.
When plotting, make sure to scale them equally, or the result will look incorrect. You X-axis has a length of around 0.9, your y axis only 0.2 - no wonder they look squashed.
Overall, maybe the data set just doesn't have k-means-style clusters? You definitely should try a density-based methods (because these can deal with outliers) such as DBSCAN. But judging from the visualizations you added, I'd say it has at most 4-5 clusters, and they are not really interesting. They probably can be captured with a number of thresholds in some dimensions.
Here is a visualization of the data set after z-normalization, visualized in parallel coordinates, with 5000 samples. Bright green is normal.
You can clearly see special properties of the data set. All of the attacks are clearly different in attributes 3 and 4 (count and srv_count) and also most very concentrated in dst_host_count and dst_host_srv_count.
I've ran OPTICS on this data set, too. It found a number of clusters, most of them in the wine-colored attack pattern. But they're not really interesting. If you have 10 different hosts ping-flooding, they will form 10 clusters.
You can see very well that OPTICS managed to cluster a number of these attacks. It missed all the orange stuff (maybe if I had set minpts lower, it is quite spread out) but it even discovered *structure within the wine-colored attack), breaking it into a number of separate events.
To really make sense of this data set, you should start with feature extraction, for example by merging such ping flood connection attempts into an aggregate event.
Also note that this is an unrealistic scenario.
Just don't use this data for classification or outlier detection. Just don't.
Quoting the KDNuggets link above:
As a result, we strongly recommend that
(1) all researchers stop using the KDD Cup '99 dataset,
(2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset, and
(3) peer reviewers for conferences and journals ding papers (or even outright reject them, as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset.
This is neither real nor realistic data. Go get something else.
Upvotes: 10