Reputation: 21
I mostly do simple merging with two files.. but this time, it is another level of merging which requires multiple condition, So I would like to seek your help or any advice will be appreciated! plase.
I have two files, one is df1 with two columns of State_code and date. df2 is about social distancing policies across states in US. Basically I wanted to do pd.merge(df1, df2, how = 'left', on=['state_code', 'date'])
,but want to make newly merged columns consist of three categorical variables 0,1,2 by conditions. {0: dates when df1 doesn't hold the policy, 1: dates under the policy, 2: dates under the eased policy}
So the files is look like as below. df1 is consisted of date starting from [February, 1st 2020 to June 30, 2020] which assigned to every single states. So each states has 150 rows(150 days). Therefore df1 consists of 7500 rows (150days * 50states). The below is sample of df1
df1 = pd.DataFrame({
'state_code':[1, 1, 1, 1, 4, 4, 2 ,2],
'date':[20200201, 20200328, 20200520, 20200404, 20200415,20200417, 20200425,20200528]})
The below is df2
df2 = pd.DataFrame({
'state_code':[1, 1, 1, 4, 4],
'StatePolicy':['SchoolClose','GatheringRestrict','StayAtHome','BarRestrict','GatheringRestrict'],
'DateIssued':[20200314,20200327,20200403,20200330,20200330],
'ExpDateExpiry':[20200831,20200522,20200430,np.nan,20200516],
'DateEased':[20200601,20200511,np.nan,20200516,np.nan],
'DateEnded':[20200626,np.nan,20200430,np.nan,20200516],
'DateReexpanded':[np.nan,np.nan,np.nan,20200629,np.nan],
'DateReeased':[np.nan,np.nan,np.nan,20200827,np.nan]})
df2 column explain:
1.state_code
2.StatePolicy : there is ten different social distancing policies in the file
3.DateIssued : First day of policy enactment
4.ExpDateExpiry : Expected Last day of policy effect (It is just expected expiration date, does not guarantee end)
5.DateEased : When the restriction of policy eased (This is not ended, just eased)
6.DateEnded : End date of policy. (It is end)
7.DateReexpanded : Date when they resumed the policy which is either eased or ended.
8.DateReeased : When the restriction of policy re-eased
The desired output is as below:
df3 = pd.DataFrame({
'state_code':[1, 1, 1, 1, 4, 4, 2,2],
'date':[20200201, 20200328, 20200520, 20200404, 20200415,20200417, 20200425,20200528],
'SchoolClose':[0,1,1,1,0,0,0,0],
'GatheringRestrict':[0,1,2,1,1,0,0,0],
'StayAtHome':[0,0,0,1,0,0,0,0],
'BarRestrict':[0,0,0,0,1,1,0,0],
'BusinessClose':[0,0,0,0,0,0,2,0]})
Mechnism:
It goes 0 when..
1.If date is before DateIssued.
2-1. If DateReexpanded is Null & DateEnded not Null —> 0 after DateEnded
2-1. If DateReexpanded is Null & DateEnded Null —> 0 after ExpDateExpiry
3-1. If DateReexpanded not Null & if DateReexpanded < DateEnded —> 0 after DateEnded
3-2 If DateReexpanded not Null & if DateReexpanded > DateEnded —> 1 after DateReexpanded
It goes 1 when..
1.If ExpDateExpiry&DateEnded&DateReexpanded&DateEased is Null ==> 1 after DateIssued
2-1.If DateEased Null& DateEnded Not Null & DateReexpanded is Null ==>1 between DateIssued and DateEnded
2-2.If DateEased Null&DateEnded Not Null & DateReexpanded Not Null ==> 1 between DateIssued and DateEnded + 1 after DateReexpanded
2-3.If DateEased Null&DateEnded is Null *ExpDateExpiry Not Null ==> 1 between DateIssued and ExpDateExpiry
3-1 If DateEased Not Null & DateReexpanded Null ==> 1 between DateIssued and DateEased
3-2 if DateEased Not Null & DateReexpanded Not Null ==> 1 between DateIssued and DateEased + 1 after DateReexpanded
It goes 2 only when.. DateEased is not Null. so all condition must include DateEased is not Null
1-1. If DateEased is not Null& DateReexpanded is Null & DateEnded not Null & DateEnded != DateEased ==> 2 between DateEased and DateEnded
1-2. If DateEased is not Null & DateReexpanded is Null & DateEnded not Null & DateEnded == DateEased ==> 0 after DateEnded
1-3. If DateEased is not Null & DateReexpanded is Null & DateEnded Null ==> 2 between DateEased and ExpDateExpiry
2.If DateEased is not Null &ExpDateExpiry & DateEnded & DateReexpanded is Null==> 2 after DateEased
I really need help.. and I try my best to explain.. I would appreciate so much for any suggestion or example how to merge two data based on one of the conditions above... Thank you.
Upvotes: 0
Views: 74
Reputation: 16683
Use .merge
and rely heavily on np.select()
. As mentioned in the comments, I did not do all the conditions, but I hope this gives you a guideline:
df3 = pd.merge(df1,df2,how='left',on='state_code')
school_close = df3['StatePolicy'] == 'SchoolClose'
school_close_rule1_c1 = df3['date'] < df3['DateIssued']
school_close_rule1_c2 = (df3['DateReexpanded'].isnull() & df3['DateEnded'].isnull())
school_close_rule1_c3 = (df3['DateReexpanded'].notnull() & df3['DateReexpanded'] < df3['DateEnded'])
school_close_rule1_conditions = school_close & (school_close_rule1_c1 | school_close_rule1_c2 | school_close_rule1_c3)
school_close_rule2_c1 = df3.iloc[:,4:8].isnull().sum(axis=1) == 4 #1.If ExpDateExpiry&DateEnded&DateReexpanded&DateEased is Null
school_close_rule2_c2 = (df3['DateEased'].isnull() & df3['DateEnded'].notnull() & df3['DateReexpanded'].isnull())
school_close_rule2_c3 = (df3['DateEased'].isnull() & df3['DateEnded'].notnull() & df3['DateReexpanded'].notnull())
school_close_rule2_c4 = (df3['DateEased'].isnull() & df3['DateEnded'].isnull() & df3['ExpDateExpiry'].notnull())
school_close_rule2_c5 = (df3['DateEased'].notnull())
school_close_rule2_conditions = school_close & (school_close_rule2_c1 | school_close_rule2_c2 |
school_close_rule2_c3 | school_close_rule2_c4 |
school_close_rule2_c5)
school_close_conditions = [school_close_rule1_conditions, school_close_rule2_conditions]
school_close_results = [0,1]
df3['SchoolClose'] = np.select(school_close_conditions, school_close_results, np.nan) #np.nan should be passed as the alternative result
df3
Out[1]:
state_code date StatePolicy DateIssued ExpDateExpiry \
0 1 20200201 SchoolClose 20200314.0 20200831.0
1 1 20200201 GatheringRestrict 20200327.0 20200522.0
2 1 20200201 StayAtHome 20200403.0 20200430.0
3 1 20200328 SchoolClose 20200314.0 20200831.0
4 1 20200328 GatheringRestrict 20200327.0 20200522.0
5 1 20200328 StayAtHome 20200403.0 20200430.0
6 1 20200520 SchoolClose 20200314.0 20200831.0
7 1 20200520 GatheringRestrict 20200327.0 20200522.0
8 1 20200520 StayAtHome 20200403.0 20200430.0
9 1 20200404 SchoolClose 20200314.0 20200831.0
10 1 20200404 GatheringRestrict 20200327.0 20200522.0
11 1 20200404 StayAtHome 20200403.0 20200430.0
12 4 20200415 BarRestrict 20200330.0 NaN
13 4 20200415 GatheringRestrict 20200330.0 20200516.0
14 4 20200417 BarRestrict 20200330.0 NaN
15 4 20200417 GatheringRestrict 20200330.0 20200516.0
16 2 20200425 NaN NaN NaN
17 2 20200528 NaN NaN NaN
DateEased DateEnded DateReexpanded DateReeased SchoolClose
0 20200601.0 20200626.0 NaN NaN 0.0
1 20200511.0 NaN NaN NaN NaN
2 NaN 20200430.0 NaN NaN NaN
3 20200601.0 20200626.0 NaN NaN 0.0
4 20200511.0 NaN NaN NaN NaN
5 NaN 20200430.0 NaN NaN NaN
6 20200601.0 20200626.0 NaN NaN 0.0
7 20200511.0 NaN NaN NaN NaN
8 NaN 20200430.0 NaN NaN NaN
9 20200601.0 20200626.0 NaN NaN 0.0
10 20200511.0 NaN NaN NaN NaN
11 NaN 20200430.0 NaN NaN NaN
12 20200516.0 NaN 20200629.0 20200827.0 NaN
13 NaN 20200516.0 NaN NaN NaN
14 20200516.0 NaN 20200629.0 20200827.0 NaN
15 NaN 20200516.0 NaN NaN NaN
16 NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN
Now, just grab the SchoolClose
column along with 'state_code'
and 'date'
and merge it back into the original dataframe. Repeat the same process for all of the other columns. I am sure that this code can really be refined, but this shows you the methodology behind the approach:
df_SchoolClose = df3[['state_code', 'date', 'SchoolClose']].dropna()
df4 = pd.merge(df1, df_SchoolClose, how='left', on=['state_code', 'date'])
df4
Out[1]:
state_code date SchoolClose
0 1 20200201 0.0
1 1 20200328 0.0
2 1 20200520 0.0
3 1 20200404 0.0
4 4 20200415 NaN
5 4 20200417 NaN
6 2 20200425 NaN
7 2 20200528 NaN
Upvotes: 1