Reputation: 1
What exactly does the DBSCAN algorithm take as input?
Why do I have different output in weka
and in a coded algorithm?
In a coded algorithm, it only takes 2 inputs while in weka
it could take 3.
Can someone help me understand the algorithm please?
Upvotes: 0
Views: 1957
Reputation: 77454
With "2 inputs", do you mean two variables (dimensions), by chance?
If your code only works with 2 dimensions, read up on distance functions. Most distance functions can be computed for more than two dimensions easily... for example, Euclidean distance is defined as
sqrt(pow(x_i-y_i, 2).sum())
which works well when you loop i from 1 to n > 2, too.
DBSCAN has 2 obvious and one hidden parameter: minPts, and epsilon are the obvious ones, and the hidden parameter is the distance function. Which has by far the largest effect on the results, and requires data understanding to choose. There is no rule of thumb to choose this parameter, unfortunately. It really depends on your data.
I'm not surprised if you get different results in the Weka implementation. It contains implicit data normalization, which tends to produce unexpected results... The best implementation of DBSCAN can IMHO be found in ELKI. If you enable data indexes, it is really fast.
Upvotes: 0
Reputation: 11454
The algorithm is described pretty well in the Wikipedia. The configuration input is:
eps
: Maximum distance for the epsilon neighborhood.minPts
: The number of points which are required to form a region.Briefly: A new cluster is created, if the epsilon neighborhood around a data point contains at least minPts
. Further input:
Upvotes: 1