Reputation: 105
I have a list of Python objects that I want to cluster into an unknown number of groups. The objects can not simply be compared by any distance function proposed by scikit-learn, but rather by a custom defined one. I'm using DBSCAN from the scikit-learn library, which when run on my data raises a TypeError.
Here's what the faulty code looks like. The objects I want to cluster are "Patch" objects, obtained from scanning a 3d mesh :
from sklearn.cluster import DBSCAN
def getPatchesSimilarity(patch1, patch2):
... #Logic to calculate distance between patches
return dist
#Reading the data (a mesh object) and extracting its patches
mesh = readMeshFromFile("foo.obj")
patchesList = extractPatchesFromMesh(mesh)
clustering = DBSCAN(metric = getPatchesSimilarity).fit(np.array([[patch] for patch in meshPatches]))
When run, this code produces the following error :
TypeError: float() argument must be a string or a number, not 'Patch'
Which seems to mean that the DBSCAN algorithm as proposed by scikit-learn doesn't work with values that aren't vectors or strings ?
I have tried also to use only the indices of the patches, so that the data passed was numerical, but it also didn't work. The last solution that would work now would be to use a distance matrix, but the number of objects is really large, and my computer wouldn't be able to store such a matrix.
Upvotes: 4
Views: 881
Reputation: 4273
Short answer: No to both parts.
DBSCAN
does support passing a metric
callable, but this would still have to be done with respect to a vector representation)..fit
has to successfully pass check_array
.One solution would be to implement a method that converts an object to a list/vector:
import numpy as np
data = np.array([[-0.538,-0.478,-0.374,-0.338,-0.346,0.230,0.246,0.366,0.362,0.342],[0.471,0.559,0.411,0.507,0.631,0.579,0.467,0.475,0.543,0.659]]).T
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def to_list(self):
return [self.x, self.y]
def __repr__(self):
return str(self.__class__.__name__) + "(" + str(self.x) + "," + str(self.y) + ")"
points = [Point(*xy) for xy in data]
# [Point(-0.538,0.471), Point(-0.478,0.559), ... , Point(0.342,0.659)]
Then you can cluster the vector representation:
from sklearn.cluster import KMeans
points_vector = np.array([point.to_list() for point in points])
# [[-0.538 0.471]
# [-0.478 0.559]
# ...
# [ 0.342 0.659]]
cluster = KMeans(n_clusters=2)
cluster.fit(points_vector)
Implementing a clustering algorithm for lists of arbitrary Python objects is probably possible (I found a cluster
library that might be close). I'd be interested if someone has tried this.
Upvotes: 2