ImAUser
ImAUser

Reputation: 129

Join with constraint on dates

I have two dataframes, as:

df1 = pd.DataFrame({'Id' : [A, B, A, A, C],
 'date' : [01-01-15, 31-01-15, 03-02-15, 03-02-15, 15-03-16],
 'attr':[1, 5, 3, 5, 6]
  }, columns=['Id','date', 'attr'])

df2 = pd.DataFrame({'Id' : [A, B, A, A, B],
 'date' : [01-02-15, 01-01-15, 02-02-15, 03-12-15, 15-03-16],
 'counter':[1, 1, 2, 3, 2]
  }, columns=['Id','date', 'counter'])

The first one contains an Id, a date and some attributes. The second one contains an Id, a date and other attributes (counters increasing as the time passes). Note that all attributes appear multiple times.

Now, I want to "join" the two based on the Id, but considering only the closest date in the past for df2.

In practice, my expected output would be:

Id | date     | att | counter
A  | 01-01-15 | 1   | Nan
B  | 31-01-15 | 5   | 1 
A  | 03-02-15 | 3   | 2
A  | 03-02-15 | 5   | 2
C  | 15-03-16 | 6   | NaN

Note that the NaN could be replaced by any other reasonable value, 0 included.

The first df contains about 300K rows, while the second one about 3 millions.

Upvotes: 1

Views: 170

Answers (1)

piRSquared
piRSquared

Reputation: 294546

You need pd.merge_asof

pd.merge_asof(df1, df2.sort_values('date'), on='date', by='Id')

  Id       date  attr  counter
0  A 2015-01-01     1      NaN
1  B 2015-01-31     5      1.0
2  A 2015-03-02     3      2.0
3  A 2015-03-02     5      2.0
4  C 2016-03-15     6      NaN

The default value for the direction argument is 'backward' From Docs

Both DataFrames must be sorted by the key.

For each row in the left DataFrame:

  • A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
  • A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
  • A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

Upvotes: 2

Related Questions