The Great
The Great

Reputation: 7733

Elegant way to assign value to a column based on day interval of a group

I have a dataframe like as given below

df1 = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,2,2,2,2],'day':[3,7,9,10,11,19,20,7,13,18,22] , 'fake_flag' :['fake VAC','','fake VAC','fake VAC','fake VAC','fake VAC','fake VAC','fake VAC','fake VAC','fake VAC','fake VAC']})

It looks like as shown below

enter image description here

I would like to fill values in actual_flag column based on below rules

a) fake_flag should have a value as fake_vac and it shouldn't be empty

b) fill values only for 1st day when fake_vac appears and for records after 14 days interval.

This is what I tried

t = df1[df1['fake_flag'] == 'fake VAC']
sub_list = t['subject_id'].unique().tolist()
   for sub in sub_list:
     day_list = t['day'][t['subject_id']==sub].tolist()
     min_value = min(day_list)
     index = t[t['day']==min_value].index
     df1.loc[index, 'actual_flag'] = 'act_vac'
     i_14day = min_value + 14
     day_values = [i for i in day_list if i >= i_14day]
     print("day greater than 14 are ", day_values)
     if len(day_values) > 0:
         for val in day_values:
            index = t[t['day']==val].index
            df1.loc[index, 'actual_flag'] = 'act_vac'

As you can see this is very lengthy and I can't do this for dataset of million records. Any efficient and elegant approach is helpful

I expect my output to be like as shown below

enter image description here

In this case for subject_id = 1, day 3 is the 1st time when fake vac appears and day 19 (19 is gt > 14 days interval from 3) and day 20 (20 is gt > 14 days interval from 3) are after 14 days interval. Any elegant and efficient solution is helpful

sample data for testing

df1 = pd.DataFrame({'subject_id' :[1,1,1,1,1,1,1,1,2,2,2,2],'day':[2,3,7,9,10,11,19,20,7,13,18,22] , 'fake_flag' :['','fake VAC','','fake VAC','fake VAC','fake VAC','fake VAC','fake VAC','fake VAC','fake VAC','fake VAC','fake VAC']})

** updated screenshot**

enter image description here

Upvotes: 1

Views: 96

Answers (1)

yatu
yatu

Reputation: 88276

One way would be to subtract the first day from all dates within each group, check which are greater than 14 and set these to "act_vac", as well as the initial days:

import numpy as np
# Returns a boolean with True if a given day - first day > 14
ix = df1.fake_flag.ne('').groupby(df1.subject_id).transform('idxmax')
c1 = df1.day.sub(df1.values[ix, 1]).gt(14)
# True if the id is different to previous row
c2 = df1.subject_id.ne(df1.subject_id.shift())
# logical OR of the above conditions
df1['actual_flag'] = np.where(c1 | c2, 'act_vac', '')

     subject_id  day fake_flag actual_flag
0            1    3  fake VAC     act_vac
1            1    7                      
2            1    9  fake VAC            
3            1   10  fake VAC            
4            1   11  fake VAC            
5            1   19  fake VAC     act_vac
6            1   20  fake VAC     act_vac
7            2    7  fake VAC     act_vac
8            2   13  fake VAC            
9            2   18  fake VAC            
10           2   22  fake VAC     act_vac

Details

df1.assign(c1=c1, c2=c2, actual_flag= np.where(c1 | c2, 'act_vac', ''))

     subject_id  day fake_flag actual_flag     c1     c2
0            1    3  fake VAC     act_vac  False   True
1            1    7                        False  False
2            1    9  fake VAC              False  False
3            1   10  fake VAC              False  False
4            1   11  fake VAC              False  False
5            1   19  fake VAC     act_vac   True  False
6            1   20  fake VAC     act_vac   True  False
7            2    7  fake VAC     act_vac  False   True
8            2   13  fake VAC              False  False
9            2   18  fake VAC              False  False
10           2   22  fake VAC     act_vac   True  False

Upvotes: 1

Related Questions