user2879969
user2879969

Reputation: 51

Parameter eps of DBSCAN, python

I have a set of points . Their geometry (SRID: 4326) is stored in a Database. I have been given a code that aims to cluster this points with DBSCAN. The parameters have been set as follow: eps=1000, min_points=1.

I obtain clusters that are less distant than 1000 meters. I believed that two points less distant than 1000 meters would belong to the same cluster. Is epsilon really in meters?

The code is the following:

    self.algorithm='DBSCAN'
    X=self.data[:,[2,3]]
    if self.debug==True:
        print 'Nbr of Points: %d'% len(X)
    # print X.shape
    # print dist_matrix.shape
    D = distance.squareform(distance.pdist(X,'euclidean'))
    # print dist_matrix
    # S = 1 - (D / np.max(D))
    db = DBSCAN(eps, min_samples).fit(D)
    self.core_samples = db.core_sample_indices_
    self.labels = db.labels

the aim is not to find another way to run it but really to understand the value of eps. What it represents in term of distance. Min_sample is set to one because I accept to have clusters with a size of 1 sample indeed.

Upvotes: 3

Views: 2454

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77495

This depends on your implementation.

Your distance function could return anything; including meters, millimeters, yards, km, miles, degrees... but you did not share what function you use for computing distance! If I'm not mistaken, SRID: 4326 does not imply anything on distance computations.

The "haversine" used by sklearn seems to use degrees, not meters.

Either way, min_points=1 is nonsensical. The query point is included, so every point itself is a cluster. With min_points <= 2, the result of DBSCAN will be a single-linkage clustering. To get a density based clustering, you need to choose a higher value to get real density.

You may want to use ELKI's DBSCAN. According to their Java sources, their distance function uses meters, but also their R*-tree index allows accelerated range queries with this distance, which will yield a substantial speed-up (O(n log n) instead of O(n^2)).

Upvotes: 3

Related Questions