Reputation: 43
I've a large dataframe as shown below:
df1:
Date Code ab-ret
0 1997-07-02 11 NaN
1 1997-07-04 11 NaN
2 1997-07-07 11 NaN
3 1997-07-08 11 NaN
4 1997-07-10 11 NaN
... ... ... ...
377395 2017-12-22 5757 -0.046651
377396 2017-12-26 5757 -0.017728
377397 2017-12-27 5757 0.024860
377398 2017-12-28 5757 0.016094
377399 2017-12-29 5757 -0.052789
377400 rows × 3 columns
I've a smaller dataframe as shown below:
df2:
Date Code
0 2009-03-17 11
1 2010-02-03 11
2 2011-02-14 363
3 2015-01-09 363
4 2010-10-15 365
... ... ...
9516 2015-02-24 449479
9517 2015-09-01 449479
9518 2016-04-01 449479
9519 2013-06-21 452095
9520 2015-05-06 553720
[9521 rows x 2 columns]
I want to compare columns 'Date'
and 'Code'
of each dataframe and whether a row in df1
has the same 'Date'
and 'Code'
as in a given row of df2
simultaneously. Based on that, I want to create a new column in df1
which states 'True'
if the above mentioned condition is satisfied and 'false'
if not satisfied. How can it be done fast (not using loops is preferred as it takes a lot of time)?
P.S. All elements in a row from df2.Date
and df2.Code
aren't guaranteed to be in a given row of df1.Date
and df1.Code.
Also, I want all the rows in df1
to remain( only looking to add a new column in df1
stating if the corresponding 'Date'
and 'Code'
is present in df2
or not). Hence, I'm not looking to merge or do an inner join.
Thus, I want the desired output as:
Date Code ab-ret Match
0 1997-07-02 11 NaN False
1 1997-07-04 11 NaN False
2 1997-07-07 11 NaN False
3 1997-07-08 11 NaN False
4 1997-07-10 11 NaN False
... ... ... ...
377395 2017-12-22 5757 -0.046651 True
377396 2017-12-26 5757 -0.017728 True
377397 2017-12-27 5757 0.024860 True
377398 2017-12-28 5757 0.016094 False
377399 2017-12-29 5757 -0.052789 True
377400 rows × 4 columns
Upvotes: 0
Views: 467
Reputation: 29635
It is a merge
operation, use the parameter indicator=True
to get a column named '_merge' close to the column 'Match' you want to create. Then just need to convert this column to False/True like in your expected output and drop
the _merge column.
df1 = (df1.merge(df2, how='left', indicator=True)
.assign(Match=lambda x: x['_merge'].eq('both'))
.drop('_merge', axis=1)
)
Upvotes: 1
Reputation: 6483
IIUC, you could try also a tuple comparison by pd.DataFrame.set_index()
and using pd.DataFrame.isin
:
df1.set_index(['Date','Code']).index.isin(df2.set_index(['Date','Code']).index.to_list())
Example:
d={'Date': {0: pd.Timestamp('1997-07-02 00:00:00'), 1: pd.Timestamp('1997-07-04 00:00:00'), 2: pd.Timestamp('1997-07-07 00:00:00')},
'Code': {0: 11, 1: 13, 2: 14}, 'ab-ret': {0: np.nan, 1: np.nan, 2: np.nan}}
df1=pd.DataFrame(d)
df1
# Date Code ab-ret
#0 1997-07-02 11 NaN
#1 1997-07-04 13 NaN
#2 1997-07-07 14 NaN
d={'Date': {0: pd.Timestamp('1997-07-02 00:00:00'), 1: pd.Timestamp('1997-07-04 00:00:00')},
'Code': {0: 11, 1: 11}, 'ab-ret': {0: np.nan, 1: np.nan}}
df2=pd.DataFrame(d)
df2
# Date Code ab-ret
#0 1997-07-02 11 NaN
#1 1997-07-04 11 NaN
df1['Match']=df1.set_index(['Date','Code']).index.isin(df2.set_index(['Date','Code']).index.to_list())
df1
# Date Code ab-ret Match
#0 1997-07-02 11 NaN True
#1 1997-07-04 13 NaN False
#2 1997-07-07 14 NaN False
Upvotes: 1