Reputation: 1495
I have a large data frame with key IDs, states, start dates and other characteristics. I have another data frame with states, a start date and a "1" to signify a flag.
I want to join the two, based on the state and the date in df1 being greater than or equal to the date in df2.
Take the example below. df1
is the table of states, start dates, and a 1 for a flag. df2
is a dataframe that needs those flags if
the date in df2 is >=
the date in df1
. The end result is df3
. The only observations get the flag whose states match and dates are >=
the original dates.
import pandas as pd
dict1 = {'date':['2020-01-01', '2020-02-15', '2020-02-04','2020-03-17',
'2020-06-15'],
'state':['AL','FL','MD','NC','SC'],
'flag': [1,1,1,1,1]}
df1 = pd.DataFrame(dict1)
df1['date'] = pd.to_datetime(df1['date'])
dict2 = {'state': ['AL','FL','MD','NC','SC'],
'keyid': ['001','002','003','004','005'],
'start_date':['2020-02-01', '2020-01-15', '2020-01-30','2020-05-18',
'2020-05-16']}
df2 = pd.DataFrame(dict2)
df2['start_date'] = pd.to_datetime(df2['start_date'])
df3 = df2
df3['flag'] = [0,1,1,0,1]
How do I get to df3 programmatically? My actual df1
has a row for each state. My actual df2
has over a million observations with different dates.
Upvotes: 1
Views: 139
Reputation: 862441
Use merge_asof
for merge by greater or equal datetimes by parameter direction='forward'
:
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
df2['need'] = [0,1,1,0,1]
df1 = df1.sort_values('date')
df2 = df2.sort_values('start_date')
df = pd.merge_asof(df2,
df1,
left_on='start_date',
right_on='date',
by='state',
direction='forward')
df['flag'] = df['flag'].fillna(0).astype(int)
print (df)
state keyid start_date need date flag
0 FL 002 2020-01-15 1 2020-02-15 1
1 MD 003 2020-01-30 1 2020-02-04 1
2 AL 001 2020-02-01 0 NaT 0
3 SC 005 2020-05-16 1 2020-06-15 1
4 NC 004 2020-05-18 0 NaT 0
You can also rename
column for avoid appending in output DataFrame
:
df2['need'] = [0,1,1,0,1]
df1 = df1.sort_values('date')
df2 = df2.sort_values('start_date')
df = pd.merge_asof(df2,
df1.rename(columns={'date':'start_date'}),
on='start_date',
by='state',
direction='forward')
df['flag'] = df['flag'].fillna(0).astype(int)
print (df)
state keyid start_date need flag
0 FL 002 2020-01-15 1 1
1 MD 003 2020-01-30 1 1
2 AL 001 2020-02-01 0 0
3 SC 005 2020-05-16 1 1
4 NC 004 2020-05-18 0 0
Upvotes: 1
Reputation: 34046
Use df.merge
and numpy.where
:
In [29]: import numpy as np
In [30]: df3 = df2.merge(df1)[['state', 'keyid', 'start_date', 'date']]
In [31]: df3['flag'] = np.where(df3['start_date'].ge(df3['date']), 0, 1)
In [33]: df3.drop('date', 1, inplace=True)
In [34]: df3
Out[34]:
state keyid start_date flag
0 AL 001 2020-02-01 0
1 FL 002 2020-01-15 1
2 MD 003 2020-01-30 1
3 NC 004 2020-05-18 0
4 SC 005 2020-05-16 1
Upvotes: 1