Jordan
Jordan

Reputation: 1495

How to join two pandas dataframes based on a date in df1 being >= date in df2

I have a large data frame with key IDs, states, start dates and other characteristics. I have another data frame with states, a start date and a "1" to signify a flag.

I want to join the two, based on the state and the date in df1 being greater than or equal to the date in df2.

Take the example below. df1 is the table of states, start dates, and a 1 for a flag. df2 is a dataframe that needs those flags if the date in df2 is >= the date in df1. The end result is df3. The only observations get the flag whose states match and dates are >= the original dates.

import pandas as pd

dict1 = {'date':['2020-01-01', '2020-02-15', '2020-02-04','2020-03-17',
                 '2020-06-15'],
         'state':['AL','FL','MD','NC','SC'],
         'flag': [1,1,1,1,1]}
df1 = pd.DataFrame(dict1)
df1['date'] = pd.to_datetime(df1['date'])

dict2 = {'state': ['AL','FL','MD','NC','SC'],
         'keyid': ['001','002','003','004','005'],
         'start_date':['2020-02-01', '2020-01-15', '2020-01-30','2020-05-18',
                       '2020-05-16']}
df2 = pd.DataFrame(dict2)
df2['start_date'] = pd.to_datetime(df2['start_date'])

df3 = df2
df3['flag'] = [0,1,1,0,1]

How do I get to df3 programmatically? My actual df1 has a row for each state. My actual df2 has over a million observations with different dates.

Upvotes: 1

Views: 139

Answers (2)

jezrael
jezrael

Reputation: 862441

Use merge_asof for merge by greater or equal datetimes by parameter direction='forward':

A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.

df2['need'] = [0,1,1,0,1]

df1 = df1.sort_values('date')
df2 = df2.sort_values('start_date')

df = pd.merge_asof(df2, 
                   df1, 
                   left_on='start_date', 
                   right_on='date', 
                   by='state',
                   direction='forward')


df['flag'] = df['flag'].fillna(0).astype(int)
print (df)

  state keyid start_date  need       date  flag
0    FL   002 2020-01-15     1 2020-02-15     1
1    MD   003 2020-01-30     1 2020-02-04     1
2    AL   001 2020-02-01     0        NaT     0
3    SC   005 2020-05-16     1 2020-06-15     1
4    NC   004 2020-05-18     0        NaT     0

You can also rename column for avoid appending in output DataFrame:

df2['need'] = [0,1,1,0,1]

df1 = df1.sort_values('date')
df2 = df2.sort_values('start_date')

df = pd.merge_asof(df2, 
                   df1.rename(columns={'date':'start_date'}), 
                   on='start_date', 
                   by='state',
                   direction='forward')

df['flag'] = df['flag'].fillna(0).astype(int)
print (df)
  state keyid start_date  need  flag
0    FL   002 2020-01-15     1     1
1    MD   003 2020-01-30     1     1
2    AL   001 2020-02-01     0     0
3    SC   005 2020-05-16     1     1
4    NC   004 2020-05-18     0     0

Upvotes: 1

Mayank Porwal
Mayank Porwal

Reputation: 34046

Use df.merge and numpy.where:

In [29]: import numpy as np

In [30]: df3 = df2.merge(df1)[['state', 'keyid', 'start_date', 'date']]

In [31]: df3['flag'] = np.where(df3['start_date'].ge(df3['date']), 0, 1)

In [33]: df3.drop('date', 1, inplace=True)

In [34]: df3
Out[34]: 
  state keyid start_date  flag
0    AL   001 2020-02-01     0
1    FL   002 2020-01-15     1
2    MD   003 2020-01-30     1
3    NC   004 2020-05-18     0
4    SC   005 2020-05-16     1

Upvotes: 1

Related Questions