Merge two dataframes for a date in date range with conditions

I have calendar dataframe as follows.

calendar = pd.DataFrame({"events": ["e1", "e2", "e3"],
                         "date_start": ["2021-02-01", "2021-02-06", "2021-02-03"],
                         "date_end":["2021-02-04", "2021-02-07", "2021-02-03"],
                         "country": ["us", "us", "uk"]})
calendar["date_start"] = pd.to_datetime(calendar["date_start"])
calendar["date_end"] = pd.to_datetime(calendar["date_end"])

and I have a daily dataframe as follows.

daily = pd.DataFrame({"date": pd.date_range(start="2021-02-01", end="2021-02-08"),
                      "value":[10, 20, 30, 40, 50, 60, 70, 80]})

I would like to take only events from US and join to the daily dataframe but the joining conditions are (date >= date_start) and (date <= date_end). So the expected output looks like this

date              value   events
2021-02-01        10      e1
2021-02-02        20      e1
2021-02-03        30      e1
2021-02-04        40      e1
2021-02-05        50
2021-02-06        60      e2
2021-02-07        70      e2
2021-02-08        80

I can do looping but it is not effective. May I have your suggestions how to do in the better way.

Upvotes: 6

Views: 598

Answers (4)

sammywemmy
sammywemmy

Reputation: 28659

One option for a non-equi join is the conditional_join from pyjanitor; underneath the hood it uses binary search to avoid a cartesian product; this can be helpful, depending on the data size:

# pip install pyjanitor
import janitor
import pandas as pd
(
daily
.conditional_join(
    calendar, 
   ("date", "date_start", ">="), 
   ("date", "date_end", "<="), 
    how="left")
.loc[:, ['date', 'value', 'events']]
)


        date  value events
0 2021-02-01     10     e1
1 2021-02-02     20     e1
2 2021-02-03     30     e1
3 2021-02-03     30     e3
4 2021-02-04     40     e1
5 2021-02-05     50    NaN
6 2021-02-06     60     e2
7 2021-02-07     70     e2
8 2021-02-08     80    NaN

Upvotes: 0

Deepak
Deepak

Reputation: 470

Here is a possible answer to your question.

import numpy as np
import pandas as pd
data_temp_1 = pd.merge(daily,calendar,how='cross')
data_temp_2 = data_temp_1.query('country=="us"')
indices = np.where((data_temp_2['date'] >= data_temp_2['date_start']) & (data_temp_2['date'] <= data_temp_2['date_end']),True,False)
final_df = data_temp_2[indices]
final_df.reset_index(drop=True,inplace=True)

final_output

To get the expected df we can use code

  • expected_df = pd.merge(daily,final_df,how='left')[['date','value','events']] expected output

Upvotes: 1

Guillaume Ansanay-Alex
Guillaume Ansanay-Alex

Reputation: 1252

You can first explode the calendar and then merge on days:

calendar['date'] = [pd.date_range(s, e, freq='d') for s, e in 
   zip(calendar['date_start'], calendar['date_end'])]
calendar = calendar.explode('date').drop(['date_start', 'date_end'], axis=1)
events = calendar.merge(daily, how='inner', on='date')
us_events = events[events.country == 'us'].drop('country', axis=1)[['date', 'value', 'events']]

I think it is faster than the other answers provided (no apply).

Upvotes: 0

Mayank Porwal
Mayank Porwal

Reputation: 34046

Use df.merge:

# Do a cross-join on the `tmp` column
In [2279]: x = calendar.assign(tmp=1).merge(daily.assign(tmp=1))

# Filter rows by providing your conditions
In [2284]: x = x[x.date.between(x.date_start, x.date_end) & x.country.eq('us')]

# Left-join with `daily` df to get all rows
In [2289]: ans = daily.merge(x[['date', 'events']], on='date', how='left')

In [2290]: ans
Out[2290]: 
        date  value events
0 2021-02-01     10     e1
1 2021-02-02     20     e1
2 2021-02-03     30     e1
3 2021-02-04     40     e1
4 2021-02-05     50    NaN
5 2021-02-06     60     e2
6 2021-02-07     70     e2
7 2021-02-08     80    NaN

Upvotes: 5

Related Questions