String Clustering Algorithm - Guidance

Question

I am trying to cluster a group of strings based upon their similarity as given by the jaro-distance. I am computing that with JellyFish in python. I have been struggling to figure out how to cluster the data. I am not an expert whatsoever in clustering, and this is my first attempt to figure out how to do this.

Currently I have written something that I believe approximates the Single Pass Method method of partitioning that works to my understanding like this:

Make the first string the cnetroid for the first cluster
For the next string calculate the similarity with that centroid using Jaro Distance.
If they are sufficiently similar then add the string to the cluster otherwise use the string to start a new cluster with it as the centroid
Repeat until no strings left to evaluate

I would like any hints about my code, but more importantly, if anyone knows of a better method for doing this. I read about k-means, but I have no idea how to specify k (let alone how to actually implement that). If you do decide to give advice, couching it in layman's terms with some idea of where to look for guidance would be much appreicated. Thanks.

A couple of notes about the code. 1. I shuffle the list as the start point is arbitrary 2. I update the centroid if one matched score scores higher than a previous match - this is arbitrary with respect to the first string that part of the cluster but is an attempt to get to the 'truest' string as the cluster centroid.

Thanks in advance for any guidance

def SLINK(SList):
    shuffle(SList)
    Clusters = []
    Centroid = []
    Scores = []
    for string in SList:
    Matched = 0

    if len(Clusters) == 0:
        Clusters.append([string])
        Centroid.append([string])
        Scores.append([])
        continue

    for ClustNum in xrange(len(Clusters)):
        Dist = jf.jaro_distance(string, Centroid[ClustNum][0])

        if Dist > 0.8:
            Clusters[ClustNum].append(string)

            if len(Scores[ClustNum]) == 0:
                Scores[ClustNum].append(Dist)
            else:
                if Dist > Scores[ClustNum]:
                    Scores[ClustNum][0] = Dist
                    Centroid[ClustNum][0] = string

            Matched = 1
            break

    if Matched ==0:       
        Clusters.append([string])
        Centroid.append([string])
        Scores.append([])

return Clusters

String Clustering Algorithm - Guidance

Answers (1)

Related Questions