Reputation: 610
I have 2 dataframes that I'm trying to join using merge_asof function. The first dataframe contains one row of an component that failed.
>>> import pandas as pd
>>> from datetime import datetime
>>> failed_date_str = '2019-05-09'
>>> failed_date = datetime.strptime(failed_date_str, '%Y-%m-%d')
>>> failed = pd.DataFrame([{"component": "a1", "failed_date": failed_date}])
>>> failed
component failed_date
0 a1 2019-05-09
The second dataframe contains a list of tickets/reasons against that component.
>>> future_failed_date = failed_date + pd.Timedelta(weeks=1)
>>> cases = pd.DataFrame(
... [
... {
.. "component": "a1",
... "tickets": 'r'+str(o),
... "created_date": future_failed_date - pd.Timedelta(days=o)
... }
... for o in range(0, 10) if o!=7
... ]
... ).sort_values(by="created_date")
>>> cases
component created_date tickets
8 a1 2019-05-07 r9
7 a1 2019-05-08 r8
6 a1 2019-05-10 r6
5 a1 2019-05-11 r5
4 a1 2019-05-12 r4
3 a1 2019-05-13 r3
2 a1 2019-05-14 r2
1 a1 2019-05-15 r1
0 a1 2019-05-16 r0
Note that the component a1 failed on 2015-05-09 and there were several tickets created before and after the failed_date.
I would like to get the closest set of tickets +/- within 2 days of the failed date. I try the following:
>>> pd.merge_asof(
... failed,
... cases,
... by="component",
... left_on="failed_date",
... right_on="created_date",
... direction="nearest",
... tolerance=pd.Timedelta(days=2)
... )
component failed_date created_date tickets
0 a1 2019-05-09 2019-05-08 r8
Since, I have specified the direction as nearest and tolerance as 2 days, I'm hoping that I would get 4 rows corresponding to dates (2019-05-07 to 2019-05-11) but I only get one row as shown above. Looks like I don't understand the usage of direction/tolerance parameter in merge_asof. Can somebody explain what I'm doing wrong?
If it helps, my version of pandas and python are as follows:
>>> pd.__version__
'0.24.2'
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
Upvotes: 1
Views: 4375
Reputation: 153510
@WenYoBen has the correct answer. "Order matters"
This is another way.
You can try something like this merge on 'component' then filter:
failed.merge(cases, on='component').loc[lambda x: (x['created_date'] - x['failed_date']).dt.days.abs() <= 2]
Output:
component failed_date tickets created_date
0 a1 2019-05-09 r9 2019-05-07
1 a1 2019-05-09 r8 2019-05-08
2 a1 2019-05-09 r6 2019-05-10
3 a1 2019-05-09 r5 2019-05-11
Upvotes: 2
Reputation: 323366
The order of your df is important for merge_asof
, it will do left merge on default :
This is similar to a left-join except that we match on nearest key rather than equal keys
pd.merge_asof(
cases,
failed,
by="component",
right_on="failed_date",
left_on="created_date",
direction='nearest',
tolerance=pd.Timedelta(days=2)
)
Out[297]:
component tickets created_date failed_date
0 a1 r9 2019-05-07 2019-05-09
1 a1 r8 2019-05-08 2019-05-09
2 a1 r6 2019-05-10 2019-05-09
3 a1 r5 2019-05-11 2019-05-09
4 a1 r4 2019-05-12 NaT
5 a1 r3 2019-05-13 NaT
6 a1 r2 2019-05-14 NaT
7 a1 r1 2019-05-15 NaT
8 a1 r0 2019-05-16 NaT
Upvotes: 2