Reputation: 1187
I have two dataframes like these:
They have the same columns.
Since I am broadcasting an API, they usually hava some overlap, which can be handled by the tradeID
which is unique.
I have tried some stuff like:
df2 = df0.join(df1, how='outer', lsuffix='_caller', rsuffix='_other')
and
df2 = df0.merge(df1, left_index=True, right_index=True)
But the results are respectively:
I am looking for a union without overlap, could someone help me?
Upvotes: 2
Views: 4182
Reputation: 8956
Seems like combine_first() should do it for you:
df2 = df0.combine_first(df1)
...where df0
takes precedence over df1
when the indicies match. Although in your case, if they're identical, it doesn't really matter. But if they're not identical, that's how combine_first()
works.
The following is an example of it working with dummy data.
Code:
import pandas as pd
import io
a = io.StringIO(u'''
tradeID,amount,date
X001,100,1/1/2016
X002,200,1/2/2016
X003,300,1/3/2016
X005,500,1/5/2016
''')
b = io.StringIO(u'''
tradeID,amount,date
X004,400,1/4/2016
X005,500,1/5/2016
X006,600,1/6/2016
''')
dfA = pd.read_csv(a, index_col = 'tradeID')
dfB = pd.read_csv(b, index_col = 'tradeID')
df = dfA.combine_first(dfB)
Output:
amount date
tradeID
X001 100.0 1/1/2016
X002 200.0 1/2/2016
X003 300.0 1/3/2016
X004 400.0 1/4/2016
X005 500.0 1/5/2016
X006 600.0 1/6/2016
If you really want to use merge
you can still do that, but you'll need to add some syntax to keep your indicies (more info):
df = dfA.reset_index().merge(dfB.reset_index(), how = 'outer').set_index('tradeID')
I ran super rudimentary timing on these two options and combine_first()
consistently beat merge
by nearly 3x on this very small data set.
...and Igor Raush's version tested at or slightly faster than combine_first()
.
Upvotes: 6
Reputation: 15240
One way to accomplish this is
pd.concat([df0, df1]).loc[lambda df: ~df.index.duplicated()]
Upvotes: 1