TheFamousRat
TheFamousRat

Reputation: 105

Clustering arbitrary objects with custom distance function in Python

I have a list of Python objects that I want to cluster into an unknown number of groups. The objects can not simply be compared by any distance function proposed by scikit-learn, but rather by a custom defined one. I'm using DBSCAN from the scikit-learn library, which when run on my data raises a TypeError.

Here's what the faulty code looks like. The objects I want to cluster are "Patch" objects, obtained from scanning a 3d mesh :

from sklearn.cluster import DBSCAN

def getPatchesSimilarity(patch1, patch2):
    ... #Logic to calculate distance between patches
    return dist 

#Reading the data (a mesh object) and extracting its patches
mesh = readMeshFromFile("foo.obj")
patchesList = extractPatchesFromMesh(mesh)

clustering = DBSCAN(metric = getPatchesSimilarity).fit(np.array([[patch] for patch in meshPatches]))

When run, this code produces the following error :

TypeError: float() argument must be a string or a number, not 'Patch'

Which seems to mean that the DBSCAN algorithm as proposed by scikit-learn doesn't work with values that aren't vectors or strings ?

I have tried also to use only the indices of the patches, so that the data passed was numerical, but it also didn't work. The last solution that would work now would be to use a distance matrix, but the number of objects is really large, and my computer wouldn't be able to store such a matrix.

Upvotes: 4

Views: 881

Answers (1)

Alexander L. Hayes
Alexander L. Hayes

Reputation: 4273

Short answer: No to both parts.

  1. "Adding an API for user-defined distance functions in clustering" has been an open issue since 2012. (Edit: I missed one part: DBSCAN does support passing a metric callable, but this would still have to be done with respect to a vector representation).
  2. Any call to .fit has to successfully pass check_array.

One solution would be to implement a method that converts an object to a list/vector:

import numpy as np
data = np.array([[-0.538,-0.478,-0.374,-0.338,-0.346,0.230,0.246,0.366,0.362,0.342],[0.471,0.559,0.411,0.507,0.631,0.579,0.467,0.475,0.543,0.659]]).T

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def to_list(self):
        return [self.x, self.y]

    def __repr__(self):
        return str(self.__class__.__name__) + "(" + str(self.x) + "," + str(self.y) + ")"

points = [Point(*xy) for xy in data]
# [Point(-0.538,0.471), Point(-0.478,0.559), ... , Point(0.342,0.659)]

Then you can cluster the vector representation:

from sklearn.cluster import KMeans

points_vector = np.array([point.to_list() for point in points])
# [[-0.538  0.471]
#  [-0.478  0.559]
#  ...
#  [ 0.342  0.659]]

cluster = KMeans(n_clusters=2)
cluster.fit(points_vector)

Implementing a clustering algorithm for lists of arbitrary Python objects is probably possible (I found a cluster library that might be close). I'd be interested if someone has tried this.

Upvotes: 2

Related Questions