J Sedai
J Sedai

Reputation: 115

Finding similar groups based on an intersection of values in a different column

I have a df that looks like this:

Group   Attribute

Cheese  Dairy
Cheese  Food
Cheese  Curd
Cow     Dairy
Cow     Food
Cow     Animal
Cow     Hair
Cow     Stomachs
Yogurt  Dairy
Yogurt  Food
Yogurt  Curd
Yogurt  Fruity

What I'd like to do for each group is find the group that it's most like, based on the intersection of Attributes. The end form I'd like is:

Group   TotalCount   LikeGroup   CommonWords  PCT

Cheese  3            Yogurt      3            100.0
Cow     5            Cheese      2            40.0
Yogurt  4            Cheese      4            75.0

I realize this might be asking a lot in one question. I can do much of it but I'm really lost on getting a count for the intersection of Attributes, even between just one Group and another. If I could find the intersection count between Cheese and Yogurt that would send me in the right direction.

Is it possible to do it within the dataframe? I can see making several lists and doing intersections between all pairs of lists, then using the new list lengths to get the percentages.

For instance, for Yogurt:

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cheese = ['Dairy','Food','Curd']

>>>Yogurt_Cheese = len(list(set(Yogurt) & set(Cheese)))/len(Yogurt)
0.75

>>>Yogurt = ['Dairy','Food','Curd','Fruity']
>>>Cow = ['Dairy','Food','Animal','Hair','Stomachs']

>>>Yogurt_Cow = len(list(set(Yogurt) & set(Cow)))/len(Yogurt)
0.5

>>>max(Yogurt_Cheese,Yogurt_Cow)
0.75

Upvotes: 2

Views: 1090

Answers (2)

Некто
Некто

Reputation: 1820

I created my own smaller version of your sample array.

import pandas as pd 
from itertools import permutations

df = pd.DataFrame(data = [['cheese','dairy'],['cheese','food'],['cheese','curd'],['cow','dairy'],['cow','food'],['yogurt','dairy'],['yogurt','food'],['yogurt','curd'],['yogurt','fruity']], columns = ['Group','Attribute'])
count_dct = df.groupby('Group').count().to_dict() # to get the TotalCount, used later
count_dct = count_dct.values()[0] # gets rid of the attribute key and returns the dictionary embedded in the list.

unique_grp = df['Group'].unique() # get the unique groups 
unique_atr = df['Attribute'].unique() # get the unique attributes

combos = list(permutations(unique_grp, 2)) # get all combinations of the groups
comp_df = pd.DataFrame(data = (combos), columns = ['Group','LikeGroup']) # create the array to put comparison data into
comp_df['CommonWords'] = 0 

for atr in unique_atr:
    temp_df = df[df['Attribute'] == atr] # break dataframe into pieces that only contain the attribute being looked at during that iteration

    myl = list(permutations(temp_df['Group'],2)) # returns the pairs that have the attribute in common as a tuple
    for comb in myl:
        comp_df.loc[(comp_df['Group'] == comb[0]) & (comp_df['LikeGroup'] == comb[1]), 'CommonWords'] += 1 # increments the CommonWords column where the Group column is equal to the first entry in the previously mentioned tuple, and the LikeGroup column is equal to the second entry.

for key, val in count_dct.iteritems(): # put the previously computed TotalCount into the comparison dataframe
    comp_df.loc[comp_df['Group'] == key, 'TotalCount'] = val

comp_df['PCT'] = (comp_df['CommonWords'] * 100.0 / comp_df['TotalCount']).round()

for my sample data I got the output

    Group LikeGroup  CommonWords  TotalCount  PCT
0  cheese       cow            2           3   67
1  cheese    yogurt            3           3  100
2     cow    cheese            2           2  100
3     cow    yogurt            2           2  100
4  yogurt    cheese            3           4   75
5  yogurt       cow            2           4   50

which seems to be correct.

Upvotes: 4

TMWP
TMWP

Reputation: 1625

It seems like you should be able to craft an aggregation strategy to crack this. Try looking at these coding samples and think about how to construct keys and aggregate functions over your data frame instead of trying to tackle it piece mail as shown in your example.

Try running this in your python environment (it was created in Jupyter notebooks using Python 2.7) and see if it gives you some ideas about your code:

np.random.seed(10)    # optional .. makes sure you get same random
                      # numbers used in the original experiment
df = pd.DataFrame({'key1':['a','a','b','b','a'],
                   'key2':['one','two','one','two','one'],
                   'data1': np.random.randn(5),
                   'data2': np.random.randn(5)})

df
group = df.groupby('key1')
group2 = df.groupby(['key1', 'key2'])
group2.agg(['count', 'sum', 'min', 'max', 'mean', 'std'])

Upvotes: 1

Related Questions