Reputation: 3031
I've been reading some of the other links ( What is a good strategy to group similar words? and Fuzzy Group By, Grouping Similar Words) that are related to group similar words. I'm curious (1) if someone can give me guidance as to how one of the algorithms I found in the second link works and (2) how the style of the programming compares to my own 'naive' approach?
If you can even just answer either 1 or 2, I'll give you an upvote.
(1) Can someone help step me through what's going on here?
class Seeder:
def __init__(self):
self.seeds = set()
self.cache = dict()
def get_seed(self, word):
LIMIT = 2
seed = self.cache.get(word,None)
if seed is not None:
return seed
for seed in self.seeds:
if self.distance(seed, word) <= LIMIT:
self.cache[word] = seed
return seed
self.seeds.add(word)
self.cache[word] = word
return word
def distance(self, s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1]
import itertools
def group_similar(words):
seeder = Seeder()
words = sorted(words, key=seeder.get_seed)
groups = itertools.groupby(words, key=seeder.get_seed)
(2) In my approach I have a list of strings I want to group called residencyList and used default dictionaries.
Array(['Psychiatry', 'Radiology Medicine-Prelim',
'Radiology Medicine-Prelim', 'Medicine', 'Medicine',
'Obstetrics/Gynecology', 'Obstetrics/Gyncology',
'Orthopaedic Surgery', 'Surgery', 'Pediatrics',
'Medicine/Pediatrics',])
My effort to group. I base it off uniqueResList, which is np.unique(residencyList)
d = collections.defaultdict(int)
for i in residencyList:
for x in uniqueResList:
if x == i:
if not d[x]:
#print i, x
d[x] = i
#print d
if d[x]:
d[x] = d.get(x, ()) + ', ' + i
else:
#print 'no match'
continue
Upvotes: 2
Views: 361
Reputation: 56
A short explanation of the "ninja maths" in distance:
# this is just the edit distance (Levenshtein) between the two words
def distance(self, s1, s2):
l1 = len(s1) # length of first word
l2 = len(s2) # length of second word
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
# make an l2 + 1 by l1 + 1 matrix where the first row and column count up from
# 0 to l1 and l2 respectively (these will be the costs of
# deleting the letters that came before that element in each word)
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]: # if the two letters are the same then we
# don't have to change them so take the
# cheapest path from the options of
# matrix[zz+1][sz] + 1 (delete the letter in s1)
# matrix[zz][sz+1] + 1 (delete the letter in s2)
# matrix[zz][sz] (leave both letters)
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else: # if the two letters are not the same then we
# have to change them so take the
# cheapest path from the options of
# matrix[zz+1][sz] + 1 (delete the letter in s1)
# matrix[zz][sz+1] + 1 (delete the letter in s2)
# matrix[zz][sz] + 1 (swap a letter)
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1] # the value at the bottom of the matrix is equal to the cheapest set of edits
Upvotes: 2
Reputation: 4745
I'll try to answer the first part. The class Seeder
attempts to find seeds
of the words. Two similar words are assumed to have the same seed and the similarity is controlled by the parameter LIMIT
(In this case it is 2) which measures a distance between two words. There are many ways to compute String distance and your class does so in the distance
function using some sort of ninja maths that is frankly above me.
def __init__(self):
self.seeds = set()
self.cache = dict()
Initialize the seeds as a set
that keeps track of unique seeds till now and a cache
that speeds up lookups in case we already have already seen the word (To save computation time).
For any word the get_seed
function returns its seed.
def get_seed(self, word):
#Set the acceptable distance
LIMIT = 2
#Have we seen this word before?
seed = self.cache.get(word,None)
if seed is not None:
#YES. Return from the cache
return seed
for seed in self.seeds:
#NO. For each pre-existing seed, find the distance of this word from that seed
if self.distance(seed, word) <= LIMIT:
#This word is similar to the seed
self.cache[word] = seed
#We found this word's seed, cache it and return
return seed
#No we couldn't find a matching word in seeds. This is a new seed
self.seeds.add(word)
#Cache this word for future
self.cache[word] = word
#And return the seed (=word)
return word
Then you sort the list of words in question by their seed. This ensures that words that have the same seed occur next to each other. This is important for the group by you use to form groups of words based on seed.
The distance
function looks complicated and could possibly be replaced by something like Levenshtein.
Upvotes: 1