improve my code to group same words in a large list python and comparison to other code

Question

I've been reading some of the other links ( What is a good strategy to group similar words? and Fuzzy Group By, Grouping Similar Words) that are related to group similar words. I'm curious (1) if someone can give me guidance as to how one of the algorithms I found in the second link works and (2) how the style of the programming compares to my own 'naive' approach?

If you can even just answer either 1 or 2, I'll give you an upvote.

(1) Can someone help step me through what's going on here?

class Seeder:
    def __init__(self):
        self.seeds = set()
        self.cache = dict()
    def get_seed(self, word):
        LIMIT = 2
        seed = self.cache.get(word,None)
        if seed is not None:
            return seed
        for seed in self.seeds:
            if self.distance(seed, word) <= LIMIT:
                self.cache[word] = seed
                return seed
        self.seeds.add(word)
        self.cache[word] = word
        return word

    def distance(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

import itertools

def group_similar(words):
    seeder = Seeder()
    words = sorted(words, key=seeder.get_seed)
    groups = itertools.groupby(words, key=seeder.get_seed)

(2) In my approach I have a list of strings I want to group called residencyList and used default dictionaries.

Array(['Psychiatry', 'Radiology Medicine-Prelim',
       'Radiology Medicine-Prelim', 'Medicine', 'Medicine',
       'Obstetrics/Gynecology', 'Obstetrics/Gyncology',
       'Orthopaedic Surgery', 'Surgery', 'Pediatrics',
       'Medicine/Pediatrics',])

My effort to group. I base it off uniqueResList, which is np.unique(residencyList)

d = collections.defaultdict(int)
for i in residencyList:
    for x in uniqueResList:
        if x ==  i:
            if not d[x]:
                #print i, x
                d[x] = i  
                #print d
            if d[x]:
                d[x] = d.get(x, ()) + ', ' + i
        else:
            #print 'no match'
            continue

RedBaron · Accepted Answer

I'll try to answer the first part. The class Seeder attempts to find seeds of the words. Two similar words are assumed to have the same seed and the similarity is controlled by the parameter LIMIT (In this case it is 2) which measures a distance between two words. There are many ways to compute String distance and your class does so in the distance function using some sort of ninja maths that is frankly above me.

def __init__(self):
    self.seeds = set()
    self.cache = dict()

Initialize the seeds as a set that keeps track of unique seeds till now and a cache that speeds up lookups in case we already have already seen the word (To save computation time).

For any word the get_seed function returns its seed.

def get_seed(self, word):
    #Set the acceptable distance
    LIMIT = 2
    #Have we seen this word before? 
    seed = self.cache.get(word,None)
    if seed is not None:
        #YES. Return from the cache
        return seed
    for seed in self.seeds:
        #NO. For each pre-existing seed, find the distance of this word from that seed
        if self.distance(seed, word) <= LIMIT:
            #This word is similar to the seed
            self.cache[word] = seed
            #We found this word's seed, cache it and return
            return seed
    #No we couldn't find a matching word in seeds. This is a new seed
    self.seeds.add(word)
    #Cache this word for future
    self.cache[word] = word
    #And return the seed (=word)
    return word

Then you sort the list of words in question by their seed. This ensures that words that have the same seed occur next to each other. This is important for the group by you use to form groups of words based on seed.

The distance function looks complicated and could possibly be replaced by something like Levenshtein.

improve my code to group same words in a large list python and comparison to other code

Answers (2)

Related Questions