Reputation: 13965
I am trying to cluster a group of strings based upon their similarity as given by the jaro-distance. I am computing that with JellyFish in python. I have been struggling to figure out how to cluster the data. I am not an expert whatsoever in clustering, and this is my first attempt to figure out how to do this.
Currently I have written something that I believe approximates the Single Pass Method method of partitioning that works to my understanding like this:
I would like any hints about my code, but more importantly, if anyone knows of a better method for doing this. I read about k-means, but I have no idea how to specify k (let alone how to actually implement that). If you do decide to give advice, couching it in layman's terms with some idea of where to look for guidance would be much appreicated. Thanks.
A couple of notes about the code. 1. I shuffle the list as the start point is arbitrary 2. I update the centroid if one matched score scores higher than a previous match - this is arbitrary with respect to the first string that part of the cluster but is an attempt to get to the 'truest' string as the cluster centroid.
Thanks in advance for any guidance
def SLINK(SList):
shuffle(SList)
Clusters = []
Centroid = []
Scores = []
for string in SList:
Matched = 0
if len(Clusters) == 0:
Clusters.append([string])
Centroid.append([string])
Scores.append([])
continue
for ClustNum in xrange(len(Clusters)):
Dist = jf.jaro_distance(string, Centroid[ClustNum][0])
if Dist > 0.8:
Clusters[ClustNum].append(string)
if len(Scores[ClustNum]) == 0:
Scores[ClustNum].append(Dist)
else:
if Dist > Scores[ClustNum]:
Scores[ClustNum][0] = Dist
Centroid[ClustNum][0] = string
Matched = 1
break
if Matched ==0:
Clusters.append([string])
Centroid.append([string])
Scores.append([])
return Clusters
Upvotes: 3
Views: 2115
Reputation: 21924
If your question is just generally about clustering I would suggest looking for a more intuitive and easily implemented version than what you've got there. Specifically the FLAME clustering algorithm has a fantastic explanation of how to implement the algorithm on wikipedia.
Upvotes: 2