Reputation: 709
I'm trying to compare every combination of phrases within a grouping to match and score them. I'm getting hung up on the looping through the groups:
import pandas as pd
from fuzzywuzzy import fuzz as fz
import itertools
data = [[1,'ab'],[1,'bc'],[1,'de'],[2,'gh'],[2,'hi'],[2,'jk'],[3,'kl'],[3,'lm'],[3,'yz']]
df = pd.DataFrame(data,columns=['Ids','DESCR'])
def iterated(df):
for a, b in itertools.product(df['DESCR'],df['DESCR']):
try:
print(a, b, fz.partial_ratio(a, b), fz.token_set_ratio(a,b))
except:
pass
return result
df.groupby('Ids').apply(iterated(df))
The above is comparing each DESCR against everything in the whole list, rather than restricting it to each grouping. I'm getting:
ab ab 100 100
ab bc 50 50
ab de 0 0
ab gh 0 0
ab hi 0 0
ab jk 0 0
ab kl 0 0
ab lm 0 0
ab yz 0 0
bc ab 67 50
bc bc 100 100
bc de 0 0
bc gh 0 0
bc hi 0 0
bc jk 0 0
bc kl 0 0
bc lm 0 0
bc yz 0 0
...
But it should be:
ab bc 50 50
ab de 0 0
bc de 0 0
gh hi 50 50
gh jk 0 0
hi jk 50 50
...
Thank you.
Upvotes: 0
Views: 207
Reputation: 11883
I think the problem is you aren't handling the groups correctly. You are grouping and then applying your function based on the DESCR results in the entire df with your command .apply(iterated(df))
. Also, I think you want to use combinations
instead of product
.
You may need to break it apart and handle the groups individually. Consider:
import pandas as pd
import itertools
data = [[1,'ab'],[1,'bc'],[1,'de'],[2,'gh'],[2,'hi'],[2,'jk'],[3,'kl'],[3,'lm'],[3,'yz']]
df = pd.DataFrame(data,columns=['Ids','DESCR'])
def show_combos(df): #replace with your function...
combos = itertools.combinations(df.DESCR, 2)
for c in combos:
print(c)
groups = df.groupby('Ids')
#iterate through the groups, which are mini-data frames
for name, group in groups:
print('group name: {}'.format(name))
show_combos(group)
print()
Which yields the groups you wanted:
group name: 1
('ab', 'bc')
('ab', 'de')
('bc', 'de')
group name: 2
('gh', 'hi')
('gh', 'jk')
('hi', 'jk')
group name: 3
('kl', 'lm')
('kl', 'yz')
('lm', 'yz')
Upvotes: 1