Reputation: 51
I have a set of points . Their geometry (SRID: 4326) is stored in a Database. I have been given a code that aims to cluster this points with DBSCAN. The parameters have been set as follow: eps=1000, min_points=1.
I obtain clusters that are less distant than 1000 meters. I believed that two points less distant than 1000 meters would belong to the same cluster. Is epsilon really in meters?
The code is the following:
self.algorithm='DBSCAN'
X=self.data[:,[2,3]]
if self.debug==True:
print 'Nbr of Points: %d'% len(X)
# print X.shape
# print dist_matrix.shape
D = distance.squareform(distance.pdist(X,'euclidean'))
# print dist_matrix
# S = 1 - (D / np.max(D))
db = DBSCAN(eps, min_samples).fit(D)
self.core_samples = db.core_sample_indices_
self.labels = db.labels
the aim is not to find another way to run it but really to understand the value of eps. What it represents in term of distance. Min_sample is set to one because I accept to have clusters with a size of 1 sample indeed.
Upvotes: 3
Views: 2454
Reputation: 77495
This depends on your implementation.
Your distance function could return anything; including meters, millimeters, yards, km, miles, degrees... but you did not share what function you use for computing distance!
If I'm not mistaken, SRID: 4326
does not imply anything on distance computations.
The "haversine"
used by sklearn seems to use degrees, not meters.
Either way, min_points=1
is nonsensical. The query point is included, so every point itself is a cluster. With min_points <= 2
, the result of DBSCAN will be a single-linkage clustering. To get a density based clustering, you need to choose a higher value to get real density.
You may want to use ELKI's DBSCAN. According to their Java sources, their distance function uses meters, but also their R*-tree index allows accelerated range queries with this distance, which will yield a substantial speed-up (O(n log n)
instead of O(n^2)
).
Upvotes: 3