Reputation: 1596
I have two data frames with structure as given below.
>>> df1
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC
2 12 Three AB
3 13 Four AC
>>> df2
IID TEXT
0 10 aa
1 10 ab
2 11 abc
3 11 a,c
4 11 ab
5 12 AA
6 13 AC
7 13 ad
8 13 abc
I want them to combine such that new data frame is a copy of df1 with the TEXT field appearing in df2 for the corresponding IID is appended to the TEXT field of df1 with duplicates removed (cases insensitive duplication check).
My expected output is
>>> df1
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC
I tried with groupby on df2, but how can I do the joint of the groupie object to a dataframe ?
Upvotes: 1
Views: 42
Reputation: 13349
I took the reverse steps. First combined the rows having the same values to a list then merge and then combine the two columns into a single column.
df1:
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC
2 12 Three AB
3 13 Four AC
df2:
IID TEXT
0 10 aa
1 10 ab
2 11 abc
3 11 a,c
4 11 ab
5 12 AA
6 13 AC
7 13 ad
8 13 abc
df3 = pd.DataFrame(df2.groupby("IID")['TEXT'].apply(list).transform(lambda x: ','.join(x).upper()).reset_index())
df3:
IID TEXT
0 10 AA,AB
1 11 ABC,A,C,AB
2 12 AA
3 13 AC,AD,ABC
df4 = pd.merge(df1,df3,on='IID')
df4:
IID NAME TEXT_x TEXT_y
0 10 One AA,AB AA,AB
1 11 Two AB,AC ABC,A,C,AB
2 12 Three AB AA
3 13 Four AC AC,AD,ABC
df4['TEXT'] = df4[['TEXT_x','TEXT_y']].apply(
lambda x: ','.join(pd.unique(','.join(x).split(','))),
axis=1
)
df4.drop(['TEXT_x','TEXT_y'],axis=1)
df5 = df1.assign(TEXT = df4.apply(
lambda x: ','.join(pd.unique(','.join(x[['TEXT_x','TEXT_y']]).split(','))),
axis=1))
df4/df5:
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC
Upvotes: 1
Reputation: 75080
I believe you need concat
with groupby.agg
to create the skeleton with duplicates , then series.explode
with groupby+unique
for de-duplicating
out = (pd.concat((df1,df2),sort=False).groupby('IID')
.agg({'NAME':'first','TEXT':','.join}).reset_index())
out['TEXT'] = (out['TEXT'].str.upper().str.split(',').explode()
.groupby(level=0).unique().str.join(','))
print(out)
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC
Upvotes: 1