Reputation: 6137
I wrote my own clustering algorithm (bad, I know) for my problem. It works well, but could work faster.
Algorithm takes list of values (1D) as in input, and works like this:
I probably reinvented a wheel here..
This is my brute foce code, how to make it faster? I've Scipy and Numpy installed, if there's something ready made
#cluster center as simple average value
def cluster_center(cluster):
return sum(cluster) / len(cluster)
#Distance between clusters
def cluster_distance(a, b):
return abs(cluster_center(a) - cluster_center(b))
while True:
cluster_distances = []
#If nothing to cluster, ready
if len(clusters) < 2:
break
#Go thru all clusters, calculate shortest distance to neighbor
for cluster in clusters:
cluster_distances.append((cluster, sorted([(cluster_distance(cluster, c), c) for c in clusters if c != cluster])[0]))
#Find out closest pair
cluster_distances.sort(cmp=lambda a,b:cmp(a[1], b[1]))
#Check if distance is under threshold 15
if cluster_distances[0][1][0] < 15:
a = cluster_distances[0][0]
b = cluster_distances[0][1][1]
#Combine clusters (combine lists)
a.extend(b)
#Form a new cluster list
clusters = [c[0] for c in cluster_distances if c[0] != b]
else:
break
Upvotes: 1
Views: 948
Reputation: 77454
Usually, the term "cluster analysis" is only used for multi-variate partitions. Because in 1d, you can actually sort your data, and solve much of these problems much easier this way.
So to speed up your approach, sort your data! And reconsider what you then need to do.
As for a more advanced method: do kernel density estimation, and look for local minima as splitting points.
Upvotes: 2