user1478842
user1478842

Reputation: 135

Clustering algorithms with custom distance function in Python

I've got a clustering problem that I believe requires an intuitive distance function. Each instance has an x, y coordinate but also has a set of attributes that describe it (varying in number per instance). Ideally it would be possible to pass it pythonobjects (instances of a class) and compare them arbitrarily based on their content.

I want to represent the distance as a weighted sum of the euclidean distance between the x, y values and something like a jaccard index to measure the set overlap of the other attributes. Something like:

dist = (euclidean(x1, y1, x2, y2) * 0.6) + (1-jaccard(attrs1, attrs2) * 0.4)

Most of the clustering algorithms and implementations I've found convert instance features into numbers. For example with dbscan in sklearn, to do my distance function I would need to convert the numbers back into the original representation somehow.

It would be great if it were possible to do clustering using a distance function that can compare instances in any arbitrary way. For example imagine a euclidean distance function that would evaluate objects as closer if they matched on another non-spatial feature.

def dist(ins1, ins2):
     euc = euclidean(ins1.x, ins1.y, ins2.x, ins2.y)
     if ins1.feature1 == ins2.feature1:
          euc = euc * 0.9
     return euc         

Is there a method that would suit this? It would also be nice if the number of clusters didn't have to be set upfront (but this is not critical for me).

Upvotes: 3

Views: 3506

Answers (1)

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77454

Actually, almost all the clustering algorithms (except for k-means, which needs numbers to compute the mean, obviously) can be used with arbitrary distance functions.

In sklearn, most algorithms accept metric="precomputed" and a distance matrix instead of the original input data. Please check the documentation more carefully. For example DBSCAN:

If metric is “precomputed”, X is assumed to be a distance matrix and must be square.

What you lose is the ability to accelerate some algorithms by indexing. Computing a distance matrix is O(n^2), so your algorithm cannot be faster than that. In sklearn, you would need to modify the sklearn Cython code to add a new distance function (using a pyfunc will yield very bad performance, unfortunately). Java tools such as ELKI can be extended with little overhead because the Just-in-time compiler of Java optimizes this well. If your distance is metric then many indexes can be used for acceleration of e.g. DBSCAN.

Upvotes: 5

Related Questions