Reputation: 435
I'm assisting on a neonatal project which basically is looking at how to give medicine dosages based on a certain score. I've essentially been given a large dataframe filled with lots of different babies and scores at different dates and time points. An example of this is below:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'baby': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
'dateandtime': ['8/2/2009 5:00:00 PM', '7/19/2009 5:00:00 PM', '7/19/2009 1:00:00 PM', '7/17/2009 6:00:00 AM','7/17/2009 12:01:00 AM', '7/14/2009 12:01:00 AM', '7/19/2009 5:00:00 AM', '7/16/2009 9:00:00 PM','7/19/2009 9:00:00 AM', '7/14/2009 6:00:00 PM', '7/15/2009 3:04:00 PM', '7/20/2009 5:00:00 PM','7/16/2009 12:01:00 AM', '7/18/2009 1:00:00 PM', '7/16/2009 6:00:00 AM', '7/13/2009 9:00:00 PM','7/19/2009 1:00:00 AM','7/15/2009 12:04:00 AM'],
'score': [6, 3, 7, 5, 10, 14, 5, 4, 11, 4, 4, 6, 7, 4, 6, 12, 6, 6]
})
print(df)
The first thing I did was to put the babies and timestamps in chronological order (this is important) using:
df = df.groupby(['baby', 'dateandtime'])
So now what I'm trying to figure out how to do is create a column to add onto my dataframe so that at each time point, I have either a -1 (decrease dosage), 0 (keep dosage the same), or +1 (increase dosage) depending upon the "score" based on a rule developed by the team. The rule I was given is as follows: It is +1 if either the sum of 3 consecutive scores ≥ 24 or most recent one or two scores is ≥ 12. It is -1 if the average score in the past 24 hours is < 8 after the scores stays below 8 for at least 48 hours. It is 0 otherwise.
So my issue is: I don't know really how to code this rule, especially since I'm not sure how to tell python "look at the last 48 hours". I think if I knew how to create a function to do this rule, I can screw around with the command df.apply(rule).reset_index(
) and then merge it with my original groupby
df to get the final result. Any ideas for how to convert this rule into code? I'm at a loss.
Upvotes: 0
Views: 333
Reputation: 727
For the second part of the rule that
-1 if the average score in the past 24 hours is < 8 after the scores stays below 8 >for at least 48 hours.
I was wondering if it means:
It's possible that the 48-hour period with scores less than 8 is the first hurdle, and then the subsequent 24-hour period with its average score less than 8 serves as the second hurdle for reducing dosage.
Importantly, it is ambiguous whether the first hurdle is cleared if such a 48-hour period data is not available.
Below is the steps to generate the maximum score between the past 24th hour and the past 72nd hour for each observation. (my apologies for a bit dirty code here)
df.dateandtime = pd.to_datetime(df['dateandtime']) # change column type for ease of indexing
df = df.sort_values(by=['baby','dateandtime'])
df= df.reset_index(drop=True)
def filter_df_for_datetime_between_timedeltas(df, varname, base_datetime, time_gr= -72, time_le= -24, unit='h'):
df['hour_diff'] = df[varname] - base_datetime
return df[ (pd.Timedelta(time_gr, unit=unit) < df.hour_diff) &
(df.hour_diff < pd.Timedelta(time_le, unit=unit))]
df_24h_to_72h_max = pd.DataFrame()
for baby in list(df.baby.unique()):
df_baby = df[df.baby==baby]
for row in df_baby.iterrows():
base_datetime = row[1]['dateandtime']
df_row_24h_to_72h = filter_df_for_datetime_between_timedeltas(df_baby, 'dateandtime', base_datetime)
# print( df_row_24h_to_72h)
if len(df_row_24h_to_72h):
data_row = df_row_24h_to_72h[df_row_24h_to_72h.score==df_row_24h_to_72h.score.max()][::-1].iloc[0]
# print(data_row)
row_stat = dict(**row[1][:2],
max_24h_to_72h = data_row.score.max(),
max_24h_to_72h_time = data_row.dateandtime,
hour_diff = -data_row.hour_diff )
df_24h_to_72h_max = df_24h_to_72h_max.append(pd.DataFrame([row_stat]), ignore_index=True)
df_24h_to_72h_max
and merge this data with the original dataset;
df2 = df.set_index(['baby', 'dateandtime']).join(df_24h_to_72h_max.set_index(['baby', 'dateandtime']))
df2 = df2.reset_index().set_index('dateandtime')
#Calculate conditions
df2['sum_3_scores'] = df2.groupby('baby')['score'].rolling(3).sum().reset_index(0,drop=True)
df2['max_2_scores'] = df2.groupby('baby')['score'].rolling(2).max().reset_index(0,drop=True)
df2['mean_24hr_score'] = df2.groupby('baby')['score'].rolling('24h').mean().reset_index(0,drop=True)
#scoring logic
def score(data):
if data['sum_3_scores'] >= 24 or data['max_2_scores'] >= 12:
return 1
if data['mean_24hr_score'] < 8 and data['max_24h_to_72h'] < 8:
return -1
return 0
df2['rule'] = df2.apply(score, axis = 1)
df2
Upvotes: 2
Reputation: 1075
Please double check the math and the logic, but I think this is the path. You definitely need to group by baby or there will be some overlap in the numbers from one baby to the next.
df.dateandtime = pd.to_datetime(df['dateandtime']) # change column type for ease of indexing
df = df.set_index('dateandtime')
df.sort_index(inplace = True)
#Calculate conditions
df['sum_3_scores'] = df.groupby('baby')['score'].rolling(3).sum().reset_index(0,drop=True)
df['max_2_scores'] = df.groupby('baby')['score'].rolling(2).max().reset_index(0,drop=True)
df['max_48hr_score'] = df.groupby('baby')['score'].rolling('48h').mean().reset_index(0,drop=True)
#you don't nead to calculate the 24hr mean because the 48hr max is 8 the 24hr mean will also be < 8
#df['mean_24hr_score'] = df.groupby('baby')['score'].rolling('24h').mean().reset_index(0,drop=True)
#scoring logic
def score(data):
if data['sum_3_scores'] >= 24 or data['max_2_scores'] >= 12:
return 1
if data['max_48hr_score'] < 8: #if the score has been below 8 for 48 hours the average of the past 24 hours will be < 8
return -1
return 0
df['rule'] = df.apply(score, axis = 1)
#just for a nicely ordered output
df.reset_index().set_index(['baby','dateandtime']).sort_index()
yields:
score sum_3_scores max_2_scores max_48hr_score rule
baby dateandtime
A 2009-07-14 00:01:00 14 NaN NaN 14.000000 0
2009-07-16 21:00:00 4 NaN 14.0 4.000000 1
2009-07-17 00:01:00 10 28.0 10.0 7.000000 1
2009-07-17 06:00:00 5 19.0 10.0 6.333333 -1
2009-07-19 05:00:00 5 20.0 5.0 5.000000 -1
2009-07-19 09:00:00 11 21.0 11.0 8.000000 0
2009-07-19 13:00:00 7 23.0 11.0 7.666667 -1
2009-07-19 17:00:00 3 21.0 7.0 6.500000 -1
2009-08-02 17:00:00 6 16.0 6.0 6.000000 -1
B 2009-07-13 21:00:00 12 NaN NaN 12.000000 0
2009-07-14 18:00:00 4 NaN 12.0 8.000000 1
2009-07-15 00:04:00 6 22.0 6.0 7.333333 -1
2009-07-15 15:04:00 4 14.0 6.0 6.500000 -1
2009-07-16 00:01:00 7 17.0 7.0 5.250000 -1
2009-07-16 06:00:00 6 17.0 7.0 5.400000 -1
2009-07-18 13:00:00 4 17.0 6.0 4.000000 -1
2009-07-19 01:00:00 6 16.0 6.0 5.000000 -1
2009-07-20 17:00:00 6 16.0 6.0 6.000000 -1
Upvotes: 1
Reputation: 91
How about trying something like:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'baby': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
'dateandtime': ['8/2/2009 5:00:00 PM', '7/19/2009 5:00:00 PM', '7/19/2009 1:00:00 PM', '7/17/2009 6:00:00 AM','7/17/2009 12:01:00 AM', '7/14/2009 12:01:00 AM', '7/19/2009 5:00:00 AM', '7/16/2009 9:00:00 PM','7/19/2009 9:00:00 AM', '7/14/2009 6:00:00 PM', '7/15/2009 3:04:00 PM', '7/20/2009 5:00:00 PM','7/16/2009 12:01:00 AM', '7/18/2009 1:00:00 PM', '7/16/2009 6:00:00 AM', '7/13/2009 9:00:00 PM','7/19/2009 1:00:00 AM','7/15/2009 12:04:00 AM'],
'score': [6, 3, 7, 5, 10, 14, 5, 4, 11, 4, 4, 6, 7, 4, 6, 12, 6, 6]
})
df.dateandtime = pd.to_datetime(df['dateandtime']) # change column type for ease of indexing
df.sort_values(by=['baby', 'dateandtime'], inplace = True)
df['2_score_sum'] = df['score'].rolling(2, min_periods=2).sum()
df['3_score_sum'] = df['score'].rolling(3, min_periods=3).sum()
df['3_score_mean'] = df['score'].rolling(3, min_periods=3).mean()
df
it's the rolling that does the trick. More info here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
from here you can just make a simple function that checks the appropriate column and returns a value for each row.
Upvotes: 1