user3314418
user3314418

Reputation: 3031

improve my code to group same words in a large list python and comparison to other code

I've been reading some of the other links ( What is a good strategy to group similar words? and Fuzzy Group By, Grouping Similar Words) that are related to group similar words. I'm curious (1) if someone can give me guidance as to how one of the algorithms I found in the second link works and (2) how the style of the programming compares to my own 'naive' approach?

If you can even just answer either 1 or 2, I'll give you an upvote.

(1) Can someone help step me through what's going on here?

class Seeder:
    def __init__(self):
        self.seeds = set()
        self.cache = dict()
    def get_seed(self, word):
        LIMIT = 2
        seed = self.cache.get(word,None)
        if seed is not None:
            return seed
        for seed in self.seeds:
            if self.distance(seed, word) <= LIMIT:
                self.cache[word] = seed
                return seed
        self.seeds.add(word)
        self.cache[word] = word
        return word

    def distance(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

import itertools

def group_similar(words):
    seeder = Seeder()
    words = sorted(words, key=seeder.get_seed)
    groups = itertools.groupby(words, key=seeder.get_seed)

(2) In my approach I have a list of strings I want to group called residencyList and used default dictionaries.

Array(['Psychiatry', 'Radiology Medicine-Prelim',
       'Radiology Medicine-Prelim', 'Medicine', 'Medicine',
       'Obstetrics/Gynecology', 'Obstetrics/Gyncology',
       'Orthopaedic Surgery', 'Surgery', 'Pediatrics',
       'Medicine/Pediatrics',])

My effort to group. I base it off uniqueResList, which is np.unique(residencyList)

d = collections.defaultdict(int)
for i in residencyList:
    for x in uniqueResList:
        if x ==  i:
            if not d[x]:
                #print i, x
                d[x] = i  
                #print d
            if d[x]:
                d[x] = d.get(x, ()) + ', ' + i
        else:
            #print 'no match'
            continue

Upvotes: 2

Views: 361

Answers (2)

user3330510
user3330510

Reputation: 56

A short explanation of the "ninja maths" in distance:

 # this is just the edit distance (Levenshtein) between the two words
    def distance(self, s1, s2):
        l1 = len(s1) # length of first word
        l2 = len(s2) # length of second word
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)] 
           # make an l2 + 1 by l1 + 1 matrix where the first row and column count up from
           # 0 to l1 and l2 respectively (these will be the costs of
           # deleting the letters that came before that element in each word)
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]: # if the two letters are the same then we
                       # don't have to change them so take the 
                       # cheapest path from the options of
                       # matrix[zz+1][sz] + 1 (delete the letter in s1)
                       # matrix[zz][sz+1] + 1 (delete the letter in s2)
                       # matrix[zz][sz] (leave both letters)
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else: # if the two letters are not the same then we
                         # have to change them so take the 
                         # cheapest path from the options of
                         # matrix[zz+1][sz] + 1 (delete the letter in s1)
                         # matrix[zz][sz+1] + 1 (delete the letter in s2)
                         # matrix[zz][sz] + 1 (swap a letter)
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1] # the value at the bottom of the matrix is equal to the cheapest set of edits

Upvotes: 2

RedBaron
RedBaron

Reputation: 4745

I'll try to answer the first part. The class Seeder attempts to find seeds of the words. Two similar words are assumed to have the same seed and the similarity is controlled by the parameter LIMIT (In this case it is 2) which measures a distance between two words. There are many ways to compute String distance and your class does so in the distance function using some sort of ninja maths that is frankly above me.

def __init__(self):
    self.seeds = set()
    self.cache = dict()

Initialize the seeds as a set that keeps track of unique seeds till now and a cache that speeds up lookups in case we already have already seen the word (To save computation time).

For any word the get_seed function returns its seed.

def get_seed(self, word):
    #Set the acceptable distance
    LIMIT = 2
    #Have we seen this word before? 
    seed = self.cache.get(word,None)
    if seed is not None:
        #YES. Return from the cache
        return seed
    for seed in self.seeds:
        #NO. For each pre-existing seed, find the distance of this word from that seed
        if self.distance(seed, word) <= LIMIT:
            #This word is similar to the seed
            self.cache[word] = seed
            #We found this word's seed, cache it and return
            return seed
    #No we couldn't find a matching word in seeds. This is a new seed
    self.seeds.add(word)
    #Cache this word for future
    self.cache[word] = word
    #And return the seed (=word)
    return word

Then you sort the list of words in question by their seed. This ensures that words that have the same seed occur next to each other. This is important for the group by you use to form groups of words based on seed.

The distance function looks complicated and could possibly be replaced by something like Levenshtein.

Upvotes: 1

Related Questions