Reputation: 5569
I am using Matlab and the code provided at http://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-estimator/content/kde.m
to cluster 1D data. In particular I estimate the density function of my data and then analysing the peaks I should be able to identify the different distributions that form my dataset. (correct?) I then cluster the points according to these cluster centroids (peaks in density function).
You can find my data (z) at: https://drive.google.com/file/d/0B3vXKJ_zYaCJLUE3YkVBMmFtbUk/view?usp=sharing
and the plot of the probabiity density function at: https://drive.google.com/file/d/0B3vXKJ_zYaCJTjVobHRBOXo4Tmc/view?usp=sharing
What I did was simply to run
[bandwidth,density,xmesh]=kde(z);
plot(xmesh,density);
What I get (please have a look at the second link) is 1 peak in the density function per data point.... I think that I am doing something wrong... Could the default parameter of the kde function be the cause?
kde(data,n,MIN,MAX)
% data - a vector of data from which the density estimate is constructed;
% n - the number of mesh points used in the uniform discretization of the
% interval [MIN, MAX]; n has to be a power of two; if n is not a power of two, then
% n is rounded up to the next power of two, i.e., n is set to n=2^ceil(log2(n));
% the default value of n is n=2^12;
% MIN, MAX - defines the interval [MIN,MAX] on which the density estimate is constructed;
% the default values of MIN and MAX are:
% MIN=min(data)-Range/10 and MAX=max(data)+Range/10, where Range=max(data)-min(data);
Would this be possible? Could you tell me on which basis should I change them?
Upvotes: 1
Views: 2092
Reputation: 5073
You point out the solution in your question. The documentation suggests that the algorithm sets a ceiling of 2^N peaks created from the data. The default (16k or 2^14) is larger than the number of points you supplied (~8k) resulting in the "spiky" behaviour.
If you instead run
[bandwidth,density,xmesh]=kde(z,2^N);
for different values of 2^N (the function demands a power of 2, must be a FFT thing) you get a plot like follows:
based on which you can pick the value of N that is appropriate.
Upvotes: 1