Finding similar groups based on an intersection of values in a different column

Question

I have a df that looks like this:

Group   Attribute

Cheese  Dairy
Cheese  Food
Cheese  Curd
Cow     Dairy
Cow     Food
Cow     Animal
Cow     Hair
Cow     Stomachs
Yogurt  Dairy
Yogurt  Food
Yogurt  Curd
Yogurt  Fruity

What I'd like to do for each group is find the group that it's most like, based on the intersection of Attributes. The end form I'd like is:

Group   TotalCount   LikeGroup   CommonWords  PCT

Cheese  3            Yogurt      3            100.0
Cow     5            Cheese      2            40.0
Yogurt  4            Cheese      4            75.0

I realize this might be asking a lot in one question. I can do much of it but I'm really lost on getting a count for the intersection of Attributes, even between just one Group and another. If I could find the intersection count between Cheese and Yogurt that would send me in the right direction.

Is it possible to do it within the dataframe? I can see making several lists and doing intersections between all pairs of lists, then using the new list lengths to get the percentages.

For instance, for Yogurt:

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cheese = ['Dairy','Food','Curd']

>>>Yogurt_Cheese = len(list(set(Yogurt) & set(Cheese)))/len(Yogurt)
0.75

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cow = ['Dairy','Food','Animal','Hair','Stomachs']

>>>Yogurt_Cow = len(list(set(Yogurt) & set(Cow)))/len(Yogurt)
0.5

>>>max(Yogurt_Cheese,Yogurt_Cow)
0.75

Некто · Accepted Answer

I created my own smaller version of your sample array.

import pandas as pd 
from itertools import permutations

df = pd.DataFrame(data = [['cheese','dairy'],['cheese','food'],['cheese','curd'],['cow','dairy'],['cow','food'],['yogurt','dairy'],['yogurt','food'],['yogurt','curd'],['yogurt','fruity']], columns = ['Group','Attribute'])
count_dct = df.groupby('Group').count().to_dict() # to get the TotalCount, used later
count_dct = count_dct.values()[0] # gets rid of the attribute key and returns the dictionary embedded in the list.

unique_grp = df['Group'].unique() # get the unique groups 
unique_atr = df['Attribute'].unique() # get the unique attributes

combos = list(permutations(unique_grp, 2)) # get all combinations of the groups
comp_df = pd.DataFrame(data = (combos), columns = ['Group','LikeGroup']) # create the array to put comparison data into
comp_df['CommonWords'] = 0 

for atr in unique_atr:
    temp_df = df[df['Attribute'] == atr] # break dataframe into pieces that only contain the attribute being looked at during that iteration

    myl = list(permutations(temp_df['Group'],2)) # returns the pairs that have the attribute in common as a tuple
    for comb in myl:
        comp_df.loc[(comp_df['Group'] == comb[0]) & (comp_df['LikeGroup'] == comb[1]), 'CommonWords'] += 1 # increments the CommonWords column where the Group column is equal to the first entry in the previously mentioned tuple, and the LikeGroup column is equal to the second entry.

for key, val in count_dct.iteritems(): # put the previously computed TotalCount into the comparison dataframe
    comp_df.loc[comp_df['Group'] == key, 'TotalCount'] = val

comp_df['PCT'] = (comp_df['CommonWords'] * 100.0 / comp_df['TotalCount']).round()

for my sample data I got the output

    Group LikeGroup  CommonWords  TotalCount  PCT
0  cheese       cow            2           3   67
1  cheese    yogurt            3           3  100
2     cow    cheese            2           2  100
3     cow    yogurt            2           2  100
4  yogurt    cheese            3           4   75
5  yogurt       cow            2           4   50

which seems to be correct.

Finding similar groups based on an intersection of values in a different column

Answers (2)

Related Questions