Alpha2020
Alpha2020

Reputation: 83

Pandas groupby and agg values in new column

import pandas as pd    
df = pd.DataFrame({'company' : [ABC, ABC , XYZ, XYZ],
                   'tin': ['5555', '1111', '5555', '2222']                   
                   })

I don't know how to get the column with group by column 'tin' if values is equal from the large dataset.

Desirable result:

df = pd.DataFrame({'company' : [ABC, ABC , XYZ, XYZ],                   
                   'tin': ['5555', '1111', '5555', '2222'],                     
                   'column' : ['text' ABC and XYZ, None,'text' ABC and XYZ, None]

               })

Upvotes: 1

Views: 63

Answers (1)

jezrael
jezrael

Reputation: 862441

I believe you need:

df1 = pd.DataFrame({ 'tin': ['5555', '5555'], 
                   'name' : 'AAA,BBB'.split(',')})

print (df1)
    tin name
0  5555  AAA
1  5555  BBB

df2 = pd.DataFrame({'company' : 'ABC,ABC,XYZ,XYZ,ABC,ABC,XYZ,XYZ'.split(','), 
                   'tin': ['5555', '1111', '5555', '2222', '5555', '1111', '5555', '2222'], 
                   'name' : 'AAA,AAA,AAA,AAA,BBB,BBB,BBB,BBB'.split(',')})

print (df2)
  company   tin name
0     ABC  5555  AAA
1     ABC  1111  AAA
2     XYZ  5555  AAA
3     XYZ  2222  AAA
4     ABC  5555  BBB
5     ABC  1111  BBB
6     XYZ  5555  BBB
7     XYZ  2222  BBB

First use DataFrame.merge for test if match by first DataFrame called df1 with parameter indicator=True and how='left' for left join:

df = df2.merge(df1, on=['tin','name'], how='left', indicator=True)
print (df)
  company   tin name     _merge
0     ABC  5555  AAA       both
1     ABC  1111  AAA  left_only
2     XYZ  5555  AAA       both
3     XYZ  2222  AAA  left_only
4     ABC  5555  BBB       both
5     ABC  1111  BBB  left_only
6     XYZ  5555  BBB       both
7     XYZ  2222  BBB  left_only

Then filter only both rows by boolean indexing:

df = df[df['_merge'].eq('both')]
print (df)
  company   tin name _merge
0     ABC  5555  AAA   both
2     XYZ  5555  AAA   both
4     ABC  5555  BBB   both
6     XYZ  5555  BBB   both

Last aggregate by both columns and assign back by DataFrame.join:

s = df.groupby(['tin','name'])['company'].agg(' and '.join).rename('new')
df = df2.join(s, on=['tin','name'])
print (df)
  company   tin name          new
0     ABC  5555  AAA  ABC and XYZ
1     ABC  1111  AAA          NaN
2     XYZ  5555  AAA  ABC and XYZ
3     XYZ  2222  AAA          NaN
4     ABC  5555  BBB  ABC and XYZ
5     ABC  1111  BBB          NaN
6     XYZ  5555  BBB  ABC and XYZ
7     XYZ  2222  BBB          NaN

Upvotes: 1

Related Questions