Brenton
Brenton

Reputation: 435

Adding column to dataframe based on implementing rule of another column

I'm assisting on a neonatal project which basically is looking at how to give medicine dosages based on a certain score. I've essentially been given a large dataframe filled with lots of different babies and scores at different dates and time points. An example of this is below:

import numpy as np
import pandas as pd
df = pd.DataFrame({
   'baby': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
   'dateandtime':  ['8/2/2009  5:00:00 PM', '7/19/2009  5:00:00 PM', '7/19/2009  1:00:00 PM', '7/17/2009  6:00:00 AM','7/17/2009  12:01:00 AM', '7/14/2009  12:01:00 AM', '7/19/2009  5:00:00 AM', '7/16/2009  9:00:00 PM','7/19/2009  9:00:00 AM', '7/14/2009  6:00:00 PM', '7/15/2009  3:04:00 PM', '7/20/2009  5:00:00 PM','7/16/2009  12:01:00 AM', '7/18/2009  1:00:00 PM', '7/16/2009  6:00:00 AM', '7/13/2009  9:00:00 PM','7/19/2009  1:00:00 AM','7/15/2009  12:04:00 AM'],
   'score':  [6, 3, 7, 5, 10, 14, 5, 4, 11, 4, 4, 6, 7, 4, 6, 12, 6, 6]
})
print(df)

The first thing I did was to put the babies and timestamps in chronological order (this is important) using:

df = df.groupby(['baby', 'dateandtime'])

So now what I'm trying to figure out how to do is create a column to add onto my dataframe so that at each time point, I have either a -1 (decrease dosage), 0 (keep dosage the same), or +1 (increase dosage) depending upon the "score" based on a rule developed by the team. The rule I was given is as follows: It is +1 if either the sum of 3 consecutive scores ≥ 24 or most recent one or two scores is ≥ 12. It is -1 if the average score in the past 24 hours is < 8 after the scores stays below 8 for at least 48 hours. It is 0 otherwise.

So my issue is: I don't know really how to code this rule, especially since I'm not sure how to tell python "look at the last 48 hours". I think if I knew how to create a function to do this rule, I can screw around with the command df.apply(rule).reset_index() and then merge it with my original groupby df to get the final result. Any ideas for how to convert this rule into code? I'm at a loss.

Upvotes: 0

Views: 333

Answers (3)

KM_83
KM_83

Reputation: 727

For the second part of the rule that

-1 if the average score in the past 24 hours is < 8 after the scores stays below 8 >for at least 48 hours.

I was wondering if it means:

  1. the average score in the past 24 hours is < 8
  2. the maximum score of the 48 hours prior to that 24 hour-period is < 8 (that is, the maximum during the past hour 24-74 is < 8)

It's possible that the 48-hour period with scores less than 8 is the first hurdle, and then the subsequent 24-hour period with its average score less than 8 serves as the second hurdle for reducing dosage.

Importantly, it is ambiguous whether the first hurdle is cleared if such a 48-hour period data is not available.

Below is the steps to generate the maximum score between the past 24th hour and the past 72nd hour for each observation. (my apologies for a bit dirty code here)

df.dateandtime = pd.to_datetime(df['dateandtime']) # change column type for ease of indexing
df = df.sort_values(by=['baby','dateandtime'])
df= df.reset_index(drop=True)


def filter_df_for_datetime_between_timedeltas(df, varname, base_datetime, time_gr= -72, time_le= -24, unit='h'):
    df['hour_diff'] = df[varname] - base_datetime
    return df[ (pd.Timedelta(time_gr, unit=unit) < df.hour_diff) & 
                (df.hour_diff < pd.Timedelta(time_le, unit=unit))]
    
    
df_24h_to_72h_max = pd.DataFrame() 

for baby in list(df.baby.unique()):
    df_baby = df[df.baby==baby]
    for row in df_baby.iterrows():
        base_datetime = row[1]['dateandtime']
        df_row_24h_to_72h = filter_df_for_datetime_between_timedeltas(df_baby, 'dateandtime', base_datetime)
        # print( df_row_24h_to_72h)
        if len(df_row_24h_to_72h):
            data_row = df_row_24h_to_72h[df_row_24h_to_72h.score==df_row_24h_to_72h.score.max()][::-1].iloc[0]
            # print(data_row)
            row_stat = dict(**row[1][:2],
                            max_24h_to_72h = data_row.score.max(),
                            max_24h_to_72h_time = data_row.dateandtime,
                            hour_diff = -data_row.hour_diff )
            df_24h_to_72h_max = df_24h_to_72h_max.append(pd.DataFrame([row_stat]), ignore_index=True)
        
        
df_24h_to_72h_max

and merge this data with the original dataset;

df2 = df.set_index(['baby', 'dateandtime']).join(df_24h_to_72h_max.set_index(['baby', 'dateandtime']))

df2 = df2.reset_index().set_index('dateandtime')
    
#Calculate conditions
df2['sum_3_scores'] = df2.groupby('baby')['score'].rolling(3).sum().reset_index(0,drop=True)
df2['max_2_scores'] = df2.groupby('baby')['score'].rolling(2).max().reset_index(0,drop=True)
df2['mean_24hr_score'] = df2.groupby('baby')['score'].rolling('24h').mean().reset_index(0,drop=True)

#scoring logic
def score(data):
    if data['sum_3_scores'] >= 24 or data['max_2_scores'] >= 12:
        return 1
    if data['mean_24hr_score'] < 8 and data['max_24h_to_72h'] < 8: 
        return -1
    return 0

df2['rule'] = df2.apply(score, axis = 1)
df2

Upvotes: 2

B. Bogart
B. Bogart

Reputation: 1075

Please double check the math and the logic, but I think this is the path. You definitely need to group by baby or there will be some overlap in the numbers from one baby to the next.

df.dateandtime = pd.to_datetime(df['dateandtime']) # change column type for ease of indexing
df = df.set_index('dateandtime')
df.sort_index(inplace = True)

#Calculate conditions
df['sum_3_scores'] = df.groupby('baby')['score'].rolling(3).sum().reset_index(0,drop=True)
df['max_2_scores'] = df.groupby('baby')['score'].rolling(2).max().reset_index(0,drop=True)
df['max_48hr_score'] = df.groupby('baby')['score'].rolling('48h').mean().reset_index(0,drop=True)

#you don't nead to calculate the 24hr mean because the 48hr max is 8 the 24hr mean will also be < 8 
#df['mean_24hr_score'] = df.groupby('baby')['score'].rolling('24h').mean().reset_index(0,drop=True)

#scoring logic
def score(data):
    if data['sum_3_scores'] >= 24 or data['max_2_scores'] >= 12:
        return 1
    if data['max_48hr_score'] < 8: #if the score has been below 8 for 48 hours the average of the past 24 hours will be < 8
        return -1
    return 0

df['rule'] = df.apply(score, axis = 1)

#just for a nicely ordered output
df.reset_index().set_index(['baby','dateandtime']).sort_index()

yields:

                          score  sum_3_scores  max_2_scores  max_48hr_score  rule
baby dateandtime                                                                 
A    2009-07-14 00:01:00     14           NaN           NaN       14.000000     0
     2009-07-16 21:00:00      4           NaN          14.0        4.000000     1
     2009-07-17 00:01:00     10          28.0          10.0        7.000000     1
     2009-07-17 06:00:00      5          19.0          10.0        6.333333    -1
     2009-07-19 05:00:00      5          20.0           5.0        5.000000    -1
     2009-07-19 09:00:00     11          21.0          11.0        8.000000     0
     2009-07-19 13:00:00      7          23.0          11.0        7.666667    -1
     2009-07-19 17:00:00      3          21.0           7.0        6.500000    -1
     2009-08-02 17:00:00      6          16.0           6.0        6.000000    -1
B    2009-07-13 21:00:00     12           NaN           NaN       12.000000     0
     2009-07-14 18:00:00      4           NaN          12.0        8.000000     1
     2009-07-15 00:04:00      6          22.0           6.0        7.333333    -1
     2009-07-15 15:04:00      4          14.0           6.0        6.500000    -1
     2009-07-16 00:01:00      7          17.0           7.0        5.250000    -1
     2009-07-16 06:00:00      6          17.0           7.0        5.400000    -1
     2009-07-18 13:00:00      4          17.0           6.0        4.000000    -1
     2009-07-19 01:00:00      6          16.0           6.0        5.000000    -1
     2009-07-20 17:00:00      6          16.0           6.0        6.000000    -1

Upvotes: 1

Xavi Pi
Xavi Pi

Reputation: 91

How about trying something like:

import numpy as np
import pandas as pd
df = pd.DataFrame({
   'baby': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
   'dateandtime':  ['8/2/2009  5:00:00 PM', '7/19/2009  5:00:00 PM', '7/19/2009  1:00:00 PM', '7/17/2009  6:00:00 AM','7/17/2009  12:01:00 AM', '7/14/2009  12:01:00 AM', '7/19/2009  5:00:00 AM', '7/16/2009  9:00:00 PM','7/19/2009  9:00:00 AM', '7/14/2009  6:00:00 PM', '7/15/2009  3:04:00 PM', '7/20/2009  5:00:00 PM','7/16/2009  12:01:00 AM', '7/18/2009  1:00:00 PM', '7/16/2009  6:00:00 AM', '7/13/2009  9:00:00 PM','7/19/2009  1:00:00 AM','7/15/2009  12:04:00 AM'],
   'score':  [6, 3, 7, 5, 10, 14, 5, 4, 11, 4, 4, 6, 7, 4, 6, 12, 6, 6]
})

df.dateandtime = pd.to_datetime(df['dateandtime']) # change column type for ease of indexing
df.sort_values(by=['baby', 'dateandtime'], inplace = True)

df['2_score_sum'] = df['score'].rolling(2, min_periods=2).sum()
df['3_score_sum'] = df['score'].rolling(3, min_periods=3).sum()
df['3_score_mean'] = df['score'].rolling(3, min_periods=3).mean()


df

it's the rolling that does the trick. More info here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html

from here you can just make a simple function that checks the appropriate column and returns a value for each row.

Upvotes: 1

Related Questions