Rachelle
Rachelle

Reputation: 21

Merge with multiple condition..(really need help)

I mostly do simple merging with two files.. but this time, it is another level of merging which requires multiple condition, So I would like to seek your help or any advice will be appreciated! plase.

I have two files, one is df1 with two columns of State_code and date. df2 is about social distancing policies across states in US. Basically I wanted to do pd.merge(df1, df2, how = 'left', on=['state_code', 'date']),but want to make newly merged columns consist of three categorical variables 0,1,2 by conditions. {0: dates when df1 doesn't hold the policy, 1: dates under the policy, 2: dates under the eased policy}

So the files is look like as below. df1 is consisted of date starting from [February, 1st 2020 to June 30, 2020] which assigned to every single states. So each states has 150 rows(150 days). Therefore df1 consists of 7500 rows (150days * 50states). The below is sample of df1

df1 = pd.DataFrame({
'state_code':[1, 1, 1, 1, 4, 4, 2 ,2],
'date':[20200201, 20200328, 20200520, 20200404, 20200415,20200417, 20200425,20200528]})

The below is df2

df2 = pd.DataFrame({
'state_code':[1, 1, 1, 4, 4],
'StatePolicy':['SchoolClose','GatheringRestrict','StayAtHome','BarRestrict','GatheringRestrict'],
'DateIssued':[20200314,20200327,20200403,20200330,20200330],
'ExpDateExpiry':[20200831,20200522,20200430,np.nan,20200516],
'DateEased':[20200601,20200511,np.nan,20200516,np.nan],
'DateEnded':[20200626,np.nan,20200430,np.nan,20200516],
'DateReexpanded':[np.nan,np.nan,np.nan,20200629,np.nan],
'DateReeased':[np.nan,np.nan,np.nan,20200827,np.nan]})

df2 column explain:

1.state_code

2.StatePolicy : there is ten different social distancing policies in the file

3.DateIssued : First day of policy enactment

4.ExpDateExpiry : Expected Last day of policy effect (It is just expected expiration date, does not guarantee end)

5.DateEased : When the restriction of policy eased (This is not ended, just eased)

6.DateEnded : End date of policy. (It is end)

7.DateReexpanded : Date when they resumed the policy which is either eased or ended.

8.DateReeased : When the restriction of policy re-eased

The desired output is as below:

df3 = pd.DataFrame({
'state_code':[1, 1, 1, 1, 4, 4, 2,2],
'date':[20200201, 20200328, 20200520, 20200404, 20200415,20200417, 20200425,20200528],
'SchoolClose':[0,1,1,1,0,0,0,0],
'GatheringRestrict':[0,1,2,1,1,0,0,0],
'StayAtHome':[0,0,0,1,0,0,0,0],
'BarRestrict':[0,0,0,0,1,1,0,0],
'BusinessClose':[0,0,0,0,0,0,2,0]})

Mechnism:

It goes 0 when..

1.If date is before DateIssued.

2-1. If DateReexpanded is Null & DateEnded not Null —> 0 after DateEnded

2-1. If DateReexpanded is Null & DateEnded Null —> 0 after ExpDateExpiry

3-1. If DateReexpanded not Null & if DateReexpanded < DateEnded —> 0 after DateEnded

3-2 If DateReexpanded not Null & if DateReexpanded > DateEnded —> 1 after DateReexpanded

It goes 1 when..

1.If ExpDateExpiry&DateEnded&DateReexpanded&DateEased is Null ==> 1 after DateIssued

2-1.If DateEased Null& DateEnded Not Null & DateReexpanded is Null ==>1 between DateIssued and DateEnded

2-2.If DateEased Null&DateEnded Not Null & DateReexpanded Not Null ==> 1 between DateIssued and DateEnded + 1 after DateReexpanded

2-3.If DateEased Null&DateEnded is Null *ExpDateExpiry Not Null ==> 1 between DateIssued and ExpDateExpiry

3-1 If DateEased Not Null & DateReexpanded Null ==> 1 between DateIssued and DateEased

3-2 if DateEased Not Null & DateReexpanded Not Null ==> 1 between DateIssued and DateEased + 1 after DateReexpanded

It goes 2 only when.. DateEased is not Null. so all condition must include DateEased is not Null

1-1. If DateEased is not Null& DateReexpanded is Null & DateEnded not Null & DateEnded != DateEased ==> 2 between DateEased and DateEnded

1-2. If DateEased is not Null & DateReexpanded is Null & DateEnded not Null & DateEnded == DateEased ==> 0 after DateEnded

1-3. If DateEased is not Null & DateReexpanded is Null & DateEnded Null ==> 2 between DateEased and ExpDateExpiry

2.If DateEased is not Null &ExpDateExpiry & DateEnded & DateReexpanded is Null==> 2 after DateEased

I really need help.. and I try my best to explain.. I would appreciate so much for any suggestion or example how to merge two data based on one of the conditions above... Thank you.

Upvotes: 0

Views: 74

Answers (1)

David Erickson
David Erickson

Reputation: 16683

Use .merge and rely heavily on np.select(). As mentioned in the comments, I did not do all the conditions, but I hope this gives you a guideline:

df3 = pd.merge(df1,df2,how='left',on='state_code')

school_close = df3['StatePolicy'] == 'SchoolClose'

school_close_rule1_c1 = df3['date'] < df3['DateIssued']
school_close_rule1_c2 = (df3['DateReexpanded'].isnull() & df3['DateEnded'].isnull())
school_close_rule1_c3 = (df3['DateReexpanded'].notnull() & df3['DateReexpanded'] < df3['DateEnded'])
school_close_rule1_conditions = school_close & (school_close_rule1_c1 | school_close_rule1_c2 | school_close_rule1_c3)

school_close_rule2_c1 = df3.iloc[:,4:8].isnull().sum(axis=1) == 4 #1.If ExpDateExpiry&DateEnded&DateReexpanded&DateEased is Null 
school_close_rule2_c2 = (df3['DateEased'].isnull() & df3['DateEnded'].notnull() & df3['DateReexpanded'].isnull())
school_close_rule2_c3 = (df3['DateEased'].isnull() & df3['DateEnded'].notnull() & df3['DateReexpanded'].notnull())
school_close_rule2_c4 = (df3['DateEased'].isnull() & df3['DateEnded'].isnull() & df3['ExpDateExpiry'].notnull())
school_close_rule2_c5 = (df3['DateEased'].notnull())
school_close_rule2_conditions = school_close & (school_close_rule2_c1 | school_close_rule2_c2 |
                                                school_close_rule2_c3 | school_close_rule2_c4 |
                                                school_close_rule2_c5)

school_close_conditions = [school_close_rule1_conditions, school_close_rule2_conditions]
school_close_results = [0,1]

df3['SchoolClose'] = np.select(school_close_conditions, school_close_results, np.nan) #np.nan should be passed as the alternative result
df3

Out[1]: 
    state_code      date        StatePolicy  DateIssued  ExpDateExpiry  \
0            1  20200201        SchoolClose  20200314.0     20200831.0   
1            1  20200201  GatheringRestrict  20200327.0     20200522.0   
2            1  20200201         StayAtHome  20200403.0     20200430.0   
3            1  20200328        SchoolClose  20200314.0     20200831.0   
4            1  20200328  GatheringRestrict  20200327.0     20200522.0   
5            1  20200328         StayAtHome  20200403.0     20200430.0   
6            1  20200520        SchoolClose  20200314.0     20200831.0   
7            1  20200520  GatheringRestrict  20200327.0     20200522.0   
8            1  20200520         StayAtHome  20200403.0     20200430.0   
9            1  20200404        SchoolClose  20200314.0     20200831.0   
10           1  20200404  GatheringRestrict  20200327.0     20200522.0   
11           1  20200404         StayAtHome  20200403.0     20200430.0   
12           4  20200415        BarRestrict  20200330.0            NaN   
13           4  20200415  GatheringRestrict  20200330.0     20200516.0   
14           4  20200417        BarRestrict  20200330.0            NaN   
15           4  20200417  GatheringRestrict  20200330.0     20200516.0   
16           2  20200425                NaN         NaN            NaN   
17           2  20200528                NaN         NaN            NaN   

     DateEased   DateEnded  DateReexpanded  DateReeased  SchoolClose  
0   20200601.0  20200626.0             NaN          NaN          0.0  
1   20200511.0         NaN             NaN          NaN          NaN  
2          NaN  20200430.0             NaN          NaN          NaN  
3   20200601.0  20200626.0             NaN          NaN          0.0  
4   20200511.0         NaN             NaN          NaN          NaN  
5          NaN  20200430.0             NaN          NaN          NaN  
6   20200601.0  20200626.0             NaN          NaN          0.0  
7   20200511.0         NaN             NaN          NaN          NaN  
8          NaN  20200430.0             NaN          NaN          NaN  
9   20200601.0  20200626.0             NaN          NaN          0.0  
10  20200511.0         NaN             NaN          NaN          NaN  
11         NaN  20200430.0             NaN          NaN          NaN  
12  20200516.0         NaN      20200629.0   20200827.0          NaN  
13         NaN  20200516.0             NaN          NaN          NaN  
14  20200516.0         NaN      20200629.0   20200827.0          NaN  
15         NaN  20200516.0             NaN          NaN          NaN  
16         NaN         NaN             NaN          NaN          NaN  
17         NaN         NaN             NaN          NaN          NaN  

Now, just grab the SchoolClose column along with 'state_code' and 'date' and merge it back into the original dataframe. Repeat the same process for all of the other columns. I am sure that this code can really be refined, but this shows you the methodology behind the approach:

df_SchoolClose = df3[['state_code', 'date', 'SchoolClose']].dropna()
df4 = pd.merge(df1, df_SchoolClose, how='left', on=['state_code', 'date'])
df4
Out[1]: 
   state_code      date  SchoolClose
0           1  20200201          0.0
1           1  20200328          0.0
2           1  20200520          0.0
3           1  20200404          0.0
4           4  20200415          NaN
5           4  20200417          NaN
6           2  20200425          NaN
7           2  20200528          NaN

Upvotes: 1

Related Questions