Bhushan
Bhushan

Reputation: 610

Pandas merge_asof with tolerance parameter fails

I have 2 dataframes that I'm trying to join using merge_asof function. The first dataframe contains one row of an component that failed.

>>> import pandas as pd
>>> from datetime import datetime

>>> failed_date_str = '2019-05-09'
>>> failed_date = datetime.strptime(failed_date_str, '%Y-%m-%d')
>>> failed = pd.DataFrame([{"component": "a1", "failed_date": failed_date}])
>>> failed
  component failed_date
0        a1  2019-05-09

The second dataframe contains a list of tickets/reasons against that component.

>>> future_failed_date = failed_date + pd.Timedelta(weeks=1)
>>> cases = pd.DataFrame(
...     [
...         {
..               "component": "a1",
...              "tickets": 'r'+str(o),
...              "created_date": future_failed_date - pd.Timedelta(days=o)
...         }
...         for o in range(0, 10) if o!=7
...     ]
... ).sort_values(by="created_date")
>>> cases
  component created_date tickets
8        a1   2019-05-07      r9
7        a1   2019-05-08      r8
6        a1   2019-05-10      r6
5        a1   2019-05-11      r5
4        a1   2019-05-12      r4
3        a1   2019-05-13      r3
2        a1   2019-05-14      r2
1        a1   2019-05-15      r1
0        a1   2019-05-16      r0

Note that the component a1 failed on 2015-05-09 and there were several tickets created before and after the failed_date.

I would like to get the closest set of tickets +/- within 2 days of the failed date. I try the following:

>>> pd.merge_asof(
...     failed,
...     cases,
...     by="component",
...     left_on="failed_date",
...     right_on="created_date",
...     direction="nearest",
...     tolerance=pd.Timedelta(days=2)
... )
  component failed_date created_date tickets
0        a1  2019-05-09   2019-05-08      r8

Since, I have specified the direction as nearest and tolerance as 2 days, I'm hoping that I would get 4 rows corresponding to dates (2019-05-07 to 2019-05-11) but I only get one row as shown above. Looks like I don't understand the usage of direction/tolerance parameter in merge_asof. Can somebody explain what I'm doing wrong?

If it helps, my version of pandas and python are as follows:

>>> pd.__version__
'0.24.2'
>>> import sys
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)

Upvotes: 1

Views: 4375

Answers (2)

Scott Boston
Scott Boston

Reputation: 153510

@WenYoBen has the correct answer. "Order matters"

This is another way.

You can try something like this merge on 'component' then filter:

failed.merge(cases, on='component').loc[lambda x: (x['created_date'] - x['failed_date']).dt.days.abs() <= 2]

Output:

  component failed_date tickets created_date
0        a1  2019-05-09      r9   2019-05-07
1        a1  2019-05-09      r8   2019-05-08
2        a1  2019-05-09      r6   2019-05-10
3        a1  2019-05-09      r5   2019-05-11

Upvotes: 2

BENY
BENY

Reputation: 323366

The order of your df is important for merge_asof, it will do left merge on default :

This is similar to a left-join except that we match on nearest key rather than equal keys

pd.merge_asof(
    cases,
    failed,
    by="component",
    right_on="failed_date",
    left_on="created_date",
    direction='nearest',
    tolerance=pd.Timedelta(days=2)
)
Out[297]: 
  component tickets created_date failed_date
0        a1      r9   2019-05-07  2019-05-09
1        a1      r8   2019-05-08  2019-05-09
2        a1      r6   2019-05-10  2019-05-09
3        a1      r5   2019-05-11  2019-05-09
4        a1      r4   2019-05-12         NaT
5        a1      r3   2019-05-13         NaT
6        a1      r2   2019-05-14         NaT
7        a1      r1   2019-05-15         NaT
8        a1      r0   2019-05-16         NaT

Upvotes: 2

Related Questions