Optimizing clustering in Python

Question

I wrote my own clustering algorithm (bad, I know) for my problem. It works well, but could work faster.

Algorithm takes list of values (1D) as in input, and works like this:

For each cluster, calculate distance to closest neighbor cluster
Select the cluster A which has smallest distance to neighbor B
If distance between A and B is less then threshold, return
Combine A and B
Goto 1.

I probably reinvented a wheel here..

This is my brute foce code, how to make it faster? I've Scipy and Numpy installed, if there's something ready made

#cluster center as simple average value
def cluster_center(cluster):
  return sum(cluster) / len(cluster)

#Distance between clusters
def cluster_distance(a, b):
  return abs(cluster_center(a) - cluster_center(b))

while True:
  cluster_distances = []

  #If nothing to cluster, ready
  if len(clusters) < 2:
    break

  #Go thru all clusters, calculate shortest distance to neighbor  
  for cluster in clusters:
    cluster_distances.append((cluster, sorted([(cluster_distance(cluster, c), c) for c in clusters if c != cluster])[0]))

  #Find out closest pair 
  cluster_distances.sort(cmp=lambda a,b:cmp(a[1], b[1]))

  #Check if distance is under threshold 15
  if cluster_distances[0][1][0] < 15:
     a = cluster_distances[0][0]
     b = cluster_distances[0][1][1]
     #Combine clusters (combine lists)
     a.extend(b)

     #Form a new cluster list
     clusters = [c[0] for c in cluster_distances if c[0] != b]
  else:
    break

Has QUIT--Anony-Mousse · Accepted Answer

Usually, the term "cluster analysis" is only used for multi-variate partitions. Because in 1d, you can actually sort your data, and solve much of these problems much easier this way.

So to speed up your approach, sort your data! And reconsider what you then need to do.

As for a more advanced method: do kernel density estimation, and look for local minima as splitting points.

Optimizing clustering in Python

Answers (1)

Related Questions