Reputation: 1187
I'm currently searching for an efficient algorithm that takes in a set of points from three dimensional spaces and groups them into classes (maybe represented by a list). A point should belong to a class if it is close to one or more other points from the class. Two classes are then the same if they share any point. Because I'm working with large data sets, I don't want to use recursive methods. Also, using something like a distance matrix with O(n^2) performance is what I try to avoid.
I tried to check for some algorithms online, but most of them don't appeal to this specific purpose (e.g. k-d tree or other cluster algorithms). I thought about parting space into smaller parts, but that (potentially) results in an inexact result.
I tried to write something myself, but it turned out to be flawed. I would sort my points after distance and append the distance as a fourth coordinate and then repeat the following the following code-segment:
def grouping_presorted(lst, distance):
positions = [0]
x = []
while positions:
curr_el = lst[ positions[-1] ]
nn_i = HasNeighbor(lst, distance, positions[-1])
if nn_i is None:
x.append(lst.pop(positions[-1]) )
positions.pop(-1)
else:
positions.append(nn_i)
return x
def HasNeighbor(lst,distance,index):
i = index+1
while lst[i][3]- lst[index][3] < distance:
dist = (lst[i][0]-lst[index][0])**2 + (lst[i][1]-lst[index][1])**2 + (lst[i][2]-lst[index][2])**2
if dist < distance:
return i
i+=1
return None
Aside from an (probably easy to fix) overflow error, there's a bigger flaw in the logic of linking the points. If you think of my points describing lines in space, the algorithm only works for lines that strictly point outwards the origin, but not for circles or similar structures.
Does anybody know of a prewritten code for this or have an idea what I could try?
Thanks in advance.
Edit: It seems my spelling and maybe confusion of some terms has sparked some misunderstandings. I hope that this (badly-made) sketch helps. In this example, I marked my reference distance as d and circled the two containers I wan't to end up with in red.
Upvotes: 3
Views: 3889
Reputation: 1187
After following all the suggestions of your comments, help from cs.stackexchange and doing some research I was able to write down two different methods for solving this problem. In case someone might be interested, I decided to share them here. Again, the problem is to write a program that takes in a set of coordinate tuples and groups them into clusters. Two points x,y belong to the same cluster if there is a sequence of elements x=x_1,..,y=x_N such that d(x_i,x_i+1)
DBSCAN: By fixing euclidean metric, minPts = 2 and grouping distance epsilon = r. scikit-learn provides a nice implementation of this algorithm. A minimal code snippet for the task would be:
from sklearn.cluster import DBSCAN
from sklearn.datasets.samples_generator import make_blobs
import networkx as nx
import scipy.spatial as sp
def cluster(data, epsilon,N): #DBSCAN, euclidean distance
db = DBSCAN(eps=epsilon, min_samples=N).fit(data)
labels = db.labels_ #labels of the found clusters
n_clusters = len(set(labels)) - (1 if -1 in labels else 0) #number of clusters
clusters = [data[labels == i] for i in range(n_clusters)] #list of clusters
return clusters, n_clusters
centers = [[1, 1,1], [-1, -1,1], [1, -1,1]]
X,_ = make_blobs(n_samples=N, centers=centers, cluster_std=0.4,
random_state=0)
cluster(X,epsilon,N)
On my machine, N=20000 for this clustering variation with an epsilon of epsilon = 0.1 takes just 290ms, so this seems really quick to me.
Graph components: One can think of this problem as follows: The coordinates define nodes of a graph, and two nodes are adjacent if their distance is smaller than epsilon/r. A cluster is then given as a connected component of this graph. At first I had problems implementing this graph, but there are many ways to write a linear time algorithm to do this. The easiest and fastest way however, for me, was to use scipy.spatial's cKDTree data structure and the corresponding query_pairs() method, that returns a list of indice tuples of points that are in given distance. One could for example write it like this:
class IGraph:
def __init__(self, nodelst=[], radius = 1):
self.igraph = nx.Graph()
self.radii = radius
self.nodelst = nodelst #nodelst is array of coordinate tuples, graph contains indices as nodes
self.__make_edges__()
def __make_edges__(self):
self.igraph.add_edges_from( sp.cKDTree(self.nodelst).query_pairs(r=self.radii) )
def get_conn_comp(self):
ind = [list(x) for x in nx.connected_components(self.igraph) if len(x)>1]
return [self.nodelst[indlist] for indlist in ind]
def graph_cluster(data, epsilon):
graph = IGraph(nodelst = data, radius = epsilon)
clusters = graph.get_conn_comp()
return clusters, len(clusters)
For the same dataset mentioned above, this method takes 420ms to find the connected components. However, for smaller clusters, e.g. N=700, this snippet runs faster. It also seems to have an advantage for finding smaller clusters (that is being given smaller epsilon values) and a vast disadvantage in the other direction (all on this specific dataset of course). I think, depending on the given situation, both methods are worth considering.
Hope this is of use for somebody.
Edit: Theoretically, DBSCAN has computational complexity O(n log n) when properly implemented (according to wikipedia...), while constructing the graph as well as finding its connected components runs linear in time. I'm not sure how well these statements hold for the given implementations though.
Upvotes: 3
Reputation: 3755
You could try https://en.wikipedia.org/wiki/OPTICS_algorithm. When you index the points first (e.g, with an R-Tree) this should be possible in O(n log n).
Edit:
If you already know your epsilon and how many points are minimally in a cluster (minpoints) then DBSCAN could be the better choice.
Upvotes: 2
Reputation: 3880
Adapt a path-finding algorithm, such as Dijkstra's or A*, or alternatively adapt the breadth-first or depth-first search of a graph. Start at any point in the set of unvisited points, and proceed with whichever algorithm you've picked with the caveat that a point is considered to be connected only to all points to which its distance is less than the threshhold. When you've finished off with one class (i.e. when you can discover no more new nodes), pick any node from the set of unvisited nodes and repeat.
Upvotes: 0