Thiago Melo
Thiago Melo

Reputation: 1187

Pandas Merge DataFrames without rows overlap

I have two dataframes like these:

enter image description here

enter image description here

They have the same columns.

Since I am broadcasting an API, they usually hava some overlap, which can be handled by the tradeID which is unique.

I have tried some stuff like:

df2 = df0.join(df1, how='outer', lsuffix='_caller', rsuffix='_other')

and

df2 = df0.merge(df1, left_index=True, right_index=True)

But the results are respectively:

enter image description here

andenter image description here

I am looking for a union without overlap, could someone help me?

Upvotes: 2

Views: 4182

Answers (2)

elPastor
elPastor

Reputation: 8956

Seems like combine_first() should do it for you:

df2 = df0.combine_first(df1)

...where df0 takes precedence over df1 when the indicies match. Although in your case, if they're identical, it doesn't really matter. But if they're not identical, that's how combine_first() works.

The following is an example of it working with dummy data.

Code:

import pandas as pd
import io

a = io.StringIO(u'''
tradeID,amount,date
X001,100,1/1/2016
X002,200,1/2/2016
X003,300,1/3/2016
X005,500,1/5/2016
''')

b = io.StringIO(u'''
tradeID,amount,date
X004,400,1/4/2016
X005,500,1/5/2016
X006,600,1/6/2016
''')

dfA = pd.read_csv(a, index_col = 'tradeID')
dfB = pd.read_csv(b, index_col = 'tradeID')

df = dfA.combine_first(dfB)

Output:

         amount      date
tradeID                  
X001      100.0  1/1/2016
X002      200.0  1/2/2016
X003      300.0  1/3/2016
X004      400.0  1/4/2016
X005      500.0  1/5/2016
X006      600.0  1/6/2016

If you really want to use merge you can still do that, but you'll need to add some syntax to keep your indicies (more info):

df = dfA.reset_index().merge(dfB.reset_index(), how = 'outer').set_index('tradeID')

I ran super rudimentary timing on these two options and combine_first() consistently beat merge by nearly 3x on this very small data set.

...and Igor Raush's version tested at or slightly faster than combine_first().

Upvotes: 6

Igor Raush
Igor Raush

Reputation: 15240

One way to accomplish this is

pd.concat([df0, df1]).loc[lambda df: ~df.index.duplicated()]

Upvotes: 1

Related Questions