Reputation: 129
I have two dataframes, as:
df1 = pd.DataFrame({'Id' : [A, B, A, A, C],
'date' : [01-01-15, 31-01-15, 03-02-15, 03-02-15, 15-03-16],
'attr':[1, 5, 3, 5, 6]
}, columns=['Id','date', 'attr'])
df2 = pd.DataFrame({'Id' : [A, B, A, A, B],
'date' : [01-02-15, 01-01-15, 02-02-15, 03-12-15, 15-03-16],
'counter':[1, 1, 2, 3, 2]
}, columns=['Id','date', 'counter'])
The first one contains an Id, a date and some attributes. The second one contains an Id, a date and other attributes (counters increasing as the time passes). Note that all attributes appear multiple times.
Now, I want to "join" the two based on the Id, but considering only the closest date in the past for df2.
In practice, my expected output would be:
Id | date | att | counter
A | 01-01-15 | 1 | Nan
B | 31-01-15 | 5 | 1
A | 03-02-15 | 3 | 2
A | 03-02-15 | 5 | 2
C | 15-03-16 | 6 | NaN
Note that the NaN could be replaced by any other reasonable value, 0 included.
The first df contains about 300K rows, while the second one about 3 millions.
Upvotes: 1
Views: 170
Reputation: 294546
You need pd.merge_asof
pd.merge_asof(df1, df2.sort_values('date'), on='date', by='Id')
Id date attr counter
0 A 2015-01-01 1 NaN
1 B 2015-01-31 5 1.0
2 A 2015-03-02 3 2.0
3 A 2015-03-02 5 2.0
4 C 2016-03-15 6 NaN
The default value for the direction
argument is 'backward'
From Docs
Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
- A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
- A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
- A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
Upvotes: 2