Leon Rai
Leon Rai

Reputation: 1631

Union of two pandas DataFrames

Say I have two data frames:

df1:

  A
0 a
1 b

df2:

  A
0 a
1 c

I want the result to be the union of the two frames with an extra column showing the source data frame that the row belongs to. In case of duplicates, duplicates should be removed and the respective extra column should show both sources:

  A  B
0 a  df1, df2
1 b  df1
2 c  df2

I can get the concatenated data frame (df3) without duplicates as follows:

import pandas as pd
df3=pd.concat([df1,df2],ignore_index=True).drop_duplicates().reset_index(drop=True)

I can't think of/find a method to have control over what element goes where. How can I add the extra column?

Thank you very much for any tips.

Upvotes: 14

Views: 14354

Answers (3)

Narges Ayoubi
Narges Ayoubi

Reputation: 66

Use the command below:

df3 = pd.concat([df1.assign(source='df1'), df2.assign(source='df2')]) \
    .groupby('A') \
    .aggregate(list) \
    .reset_index()

The result will be:

   A      source
0  a  [df1, df2]
1  b       [df1]
2  c       [df2]

The assign will add a column named source with value df1 and df2 to your dataframes. groupby command groups rows with same A value to single row. aggregate command describes how to aggregate other columns (source) for each group of rows with same A. I have used list aggregate function so that the source column be the list of values with same A.

Upvotes: 3

cph_sto
cph_sto

Reputation: 7585

We use outer join to solve this -

df1 = pd.DataFrame({'A':['a','b']})
df2 = pd.DataFrame({'A':['a','c']})
df1['col1']='df1'
df2['col2']='df2'
df=pd.merge(df1, df2, on=['A'], how="outer").fillna('')
df['B']=df['col1']+','+df['col2']
df['B'] = df['B'].str.strip(',')
df=df[['A','B']]
df

   A        B
0  a  df1,df2
1  b      df1
2  c      df2

Upvotes: 2

cs95
cs95

Reputation: 402493

Merge with an indicator argument, and remap the result:

m = {'left_only': 'df1', 'right_only': 'df2', 'both': 'df1, df2'}

result = df1.merge(df2, on=['A'], how='outer', indicator='B')
result['B'] = result['B'].map(m)

result
   A         B
0  a  df1, df2
1  b       df1
2  c       df2

Upvotes: 15

Related Questions