Reputation: 83
import pandas as pd
df = pd.DataFrame({'company' : [ABC, ABC , XYZ, XYZ],
'tin': ['5555', '1111', '5555', '2222']
})
I don't know how to get the column with group by column 'tin' if values is equal from the large dataset.
Desirable result:
df = pd.DataFrame({'company' : [ABC, ABC , XYZ, XYZ],
'tin': ['5555', '1111', '5555', '2222'],
'column' : ['text' ABC and XYZ, None,'text' ABC and XYZ, None]
})
Upvotes: 1
Views: 63
Reputation: 862441
I believe you need:
df1 = pd.DataFrame({ 'tin': ['5555', '5555'],
'name' : 'AAA,BBB'.split(',')})
print (df1)
tin name
0 5555 AAA
1 5555 BBB
df2 = pd.DataFrame({'company' : 'ABC,ABC,XYZ,XYZ,ABC,ABC,XYZ,XYZ'.split(','),
'tin': ['5555', '1111', '5555', '2222', '5555', '1111', '5555', '2222'],
'name' : 'AAA,AAA,AAA,AAA,BBB,BBB,BBB,BBB'.split(',')})
print (df2)
company tin name
0 ABC 5555 AAA
1 ABC 1111 AAA
2 XYZ 5555 AAA
3 XYZ 2222 AAA
4 ABC 5555 BBB
5 ABC 1111 BBB
6 XYZ 5555 BBB
7 XYZ 2222 BBB
First use DataFrame.merge
for test if match by first DataFrame called df1
with parameter indicator=True
and how='left' for left join:
df = df2.merge(df1, on=['tin','name'], how='left', indicator=True)
print (df)
company tin name _merge
0 ABC 5555 AAA both
1 ABC 1111 AAA left_only
2 XYZ 5555 AAA both
3 XYZ 2222 AAA left_only
4 ABC 5555 BBB both
5 ABC 1111 BBB left_only
6 XYZ 5555 BBB both
7 XYZ 2222 BBB left_only
Then filter only both rows by boolean indexing
:
df = df[df['_merge'].eq('both')]
print (df)
company tin name _merge
0 ABC 5555 AAA both
2 XYZ 5555 AAA both
4 ABC 5555 BBB both
6 XYZ 5555 BBB both
Last aggregate by both columns and assign back by DataFrame.join
:
s = df.groupby(['tin','name'])['company'].agg(' and '.join).rename('new')
df = df2.join(s, on=['tin','name'])
print (df)
company tin name new
0 ABC 5555 AAA ABC and XYZ
1 ABC 1111 AAA NaN
2 XYZ 5555 AAA ABC and XYZ
3 XYZ 2222 AAA NaN
4 ABC 5555 BBB ABC and XYZ
5 ABC 1111 BBB NaN
6 XYZ 5555 BBB ABC and XYZ
7 XYZ 2222 BBB NaN
Upvotes: 1