Shew
Shew

Reputation: 1596

pandas combine a data frame with another groupby dataframe

I have two data frames with structure as given below.

>>> df1
   IID   NAME   TEXT
0   10    One  AA,AB
1   11    Two  AB,AC
2   12  Three     AB
3   13   Four     AC
>>> df2
   IID TEXT
0   10   aa
1   10   ab
2   11  abc
3   11  a,c
4   11   ab
5   12   AA
6   13   AC
7   13   ad
8   13  abc

I want them to combine such that new data frame is a copy of df1 with the TEXT field appearing in df2 for the corresponding IID is appended to the TEXT field of df1 with duplicates removed (cases insensitive duplication check).

My expected output is

>>> df1
   IID   NAME           TEXT
0   10    One          AA,AB
1   11    Two  AB,AC,ABC,A,C
2   12  Three          AB,AA
3   13   Four      AC,AD,ABC

I tried with groupby on df2, but how can I do the joint of the groupie object to a dataframe ?

Upvotes: 1

Views: 42

Answers (2)

Pygirl
Pygirl

Reputation: 13349

I took the reverse steps. First combined the rows having the same values to a list then merge and then combine the two columns into a single column.

df1:

    IID NAME    TEXT
0   10  One     AA,AB
1   11  Two     AB,AC
2   12  Three   AB
3   13  Four    AC

df2:

    IID TEXT
0   10  aa
1   10  ab
2   11  abc
3   11  a,c
4   11  ab
5   12  AA
6   13  AC
7   13  ad
8   13  abc

df3 = pd.DataFrame(df2.groupby("IID")['TEXT'].apply(list).transform(lambda x: ','.join(x).upper()).reset_index())

df3:

IID TEXT
0   10  AA,AB
1   11  ABC,A,C,AB
2   12  AA
3   13  AC,AD,ABC

df4 = pd.merge(df1,df3,on='IID')

df4:

    IID NAME    TEXT_x  TEXT_y
0   10  One     AA,AB   AA,AB
1   11  Two     AB,AC   ABC,A,C,AB
2   12  Three   AB  AA
3   13  Four    AC  AC,AD,ABC

df4['TEXT'] = df4[['TEXT_x','TEXT_y']].apply(
    lambda x: ','.join(pd.unique(','.join(x).split(','))),
    axis=1
)
df4.drop(['TEXT_x','TEXT_y'],axis=1)

OR

df5 = df1.assign(TEXT = df4.apply(
    lambda x: ','.join(pd.unique(','.join(x[['TEXT_x','TEXT_y']]).split(','))),
    axis=1))

df4/df5:

    IID NAME    TEXT
0   10  One     AA,AB
1   11  Two     AB,AC,ABC,A,C
2   12  Three   AB,AA
3   13  Four    AC,AD,ABC

Upvotes: 1

anky
anky

Reputation: 75080

I believe you need concat with groupby.agg to create the skeleton with duplicates , then series.explode with groupby+unique for de-duplicating

out = (pd.concat((df1,df2),sort=False).groupby('IID')
      .agg({'NAME':'first','TEXT':','.join}).reset_index())
out['TEXT'] = (out['TEXT'].str.upper().str.split(',').explode()
              .groupby(level=0).unique().str.join(','))
print(out)

   IID   NAME           TEXT
0   10    One          AA,AB
1   11    Two  AB,AC,ABC,A,C
2   12  Three          AB,AA
3   13   Four      AC,AD,ABC

Upvotes: 1

Related Questions