Reputation: 471
I have a dataframe which looks like below and I am trying to calculate the daily dose for each drug. In the below table I have one drug but in the actual table I have multiple drugs.
Let me give a brief explanation of the table and the scenario. The ID column is the unique patient id, The start, and end column says when the medication has started and ended for that rate, the hours' column says # of hours between start and end for that ID, the rate column says at what rate the medication was given during that period of time for that ID. The way the table is designed is there is a start and end date for each rate for every ID.
In a hospital setting, if they want to calculate the total daily dose, they calculate from 7 AM today to 7 AM the next day rather than doing it from midnight to midnight.
ID START END DRUG RATE HOURS
15114299 2011-03-18 18:23:00 2011-03-19 20:59:59 morphine 0.03 26.6
15114299 2011-03-19 21:00:00 2011-03-20 01:29:59 morphine 0.1 4.4
15340823 2011-02-18 16:00:00 2011-02-19 03:09:59 morphine 0.1 11.16
15340823 2011-02-19 15:30:00 2011-02-19 21:59:59 morphine 0.05 6.49
15621352 2011-04-20 03:44:00 2011-04-20 13:41:59 morphine 0.05 9.96
15621352 2011-04-20 13:42:00 2011-04-20 23:59:59 morphine 0.5 10.29
15775791 2011-07-21 19:00:00 2011-07-24 03:59:59 morphine 0.1 56.99
15775791 2011-07-24 04:00:00 2011-07-24 14:14:59 morphine 0.05 10.24
15947559 2011-07-05 19:40:00 2011-07-06 05:43:59 morphine 0.1 10.06
15947559 2011-07-06 05:44:00 2011-07-09 01:59:59 morphine 0.15 68.26
15947559 2011-07-09 02:00:00 2011-07-09 18:59:59 morphine 0.1 16.99
15947559 2011-07-14 19:30:00 2011-07-15 18:29:59 morphine 0.1 22.99
15947559 2011-07-15 18:30:00 2011-07-17 02:59:59 morphine 0.15 32.49
15947559 2011-07-17 03:00:00 2011-07-17 08:59:59 morphine 0.1 5.99
15947559 2011-07-17 09:00:00 2011-07-17 16:59:59 morphine 0.075 7.99
As we can see in the example dataframe, the first medication was for 26.6 hours and what I want to do is split that record into two which looks like below and having a day column would be useful to group by and get the sum for each day in the end.
Expected DF:
ID START END DRUG RATE HOURS Days
15114299 2011-03-18 18:23:00 2011-03-19 06:59:59 morphine 0.03 12.62 Day1
15114299 2011-03-19 07:00:00 2011-03-19 20:59:59 morphine 0.03 14 Day2
15114299 2011-03-19 21:00:00 2011-03-20 01:29:59 morphine 0.1 4.5 Day2
15340823 2011-02-18 16:00:00 2011-02-19 03:09:59 morphine 0.1 11.16 Day1
15340823 2011-02-19 15:30:00 2011-02-19 21:59:59 morphine 0.05 6.49 Day2
15621352 2011-04-20 03:44:00 2011-04-20 06:59:59 morphine 0.05 3.27 Day1
15621352 2011-04-20 07:00:00 2011-04-20 13:41:59 morphine 0.05 6.70 Day2
15621352 2011-04-20 13:42:00 2011-04-20 23:59:59 morphine 0.5 10.29 Day2
15775791 2011-07-21 19:00:00 2011-07-22 06:59:59 morphine 0.1 12 Day1
15775791 2011-07-22 07:00:00 2011-07-23 06:59:59 morphine 0.1 24 Day2
15775791 2011-07-23 07:00:00 2011-07-24 03:59:59 morphine 0.1 21 Day3
15775791 2011-07-24 04:00:00 2011-07-24 06:59:59 morphine 0.05 3 Day3
15775791 2011-07-24 07:00:00 2011-07-24 14:14:59 morphine 0.05 7.25 Day4
15947559 2011-07-05 19:40:00 2011-07-06 05:43:59 morphine 0.1 10.06 Day1
15947559 2011-07-06 05:44:00 2011-07-06 06:59:59 morphine 0.15 1.27 Day1
15947559 2011-07-06 07:00:00 2011-07-07 06:59:59 morphine 0.15 24 Day2
15947559 2011-07-07 07:00:00 2011-07-08 06:59:59 morphine 0.15 24 Day3
15947559 2011-07-08 07:00:00 2011-07-09 01:59:59 morphine 0.15 19 Day4
15947559 2011-07-09 02:00:00 2011-07-09 06:59:59 morphine 0.1 5 Day4
15947559 2011-07-09 07:00:00 2011-07-09 18:59:59 morphine 0.1 12 Day5
15947559 2011-07-14 19:30:00 2011-07-15 06:59:59 morphine 0.1 11.50 Day6
15947559 2011-07-15 07:00:00 2011-07-15 18:29:59 morphine 0.1 11.50 Day7
15947559 2011-07-15 18:30:00 2011-07-16 06:59:59 morphine 0.15 12.50 Day7
15947559 2011-07-16 07:00:00 2011-07-17 02:59:59 morphine 0.15 20 Day8
15947559 2011-07-17 03:00:00 2011-07-17 06:59:59 morphine 0.1 4 Day8
15947559 2011-07-17 07:00:00 2011-07-17 08:59:59 morphine 0.1 2 Day9
15947559 2011-07-17 09:00:00 2011-07-17 16:59:59 morphine 0.075 8 Day9
(I created the spaces between two ID's just to be more clear and understanding)
I tried the following code.
test = pd.concat([pd.DataFrame({'START': pd.date_range(row['START'], row['END'], freq='D'),
'PERSON_ID': row['ID'],
'DRUG': row['DRUG'],
'RATE': row['RATE']}, columns=['ID', 'START', 'DRUG', 'RATE'])
for i, row in df.iterrows()], ignore_index=False)
but this creates a dataframe in the following way.
ID START DRUG RATE
15114299 2011-03-18 18:23:00 morphine 0.03
15114299 2011-03-19 18:23:00 morphine 0.03
15114299 2011-03-19 21:00:00 morphine 0.1
15340823 2011-02-18 16:00:00 morphine 0.1
15340823 2011-02-19 15:30:00 morphine 0.05
15621352 2011-04-20 03:44:00 morphine 0.05
15621352 2011-04-20 13:42:00 morphine 0.5
15775791 2011-07-21 19:00:00 morphine 0.1
15775791 2011-07-22 19:00:00 morphine 0.1
15775791 2011-07-23 19:00:00 morphine 0.1
15775791 2011-07-24 04:00:00 morphine 0.05
15947559 2011-07-05 19:40:00 morphine 0.1
15947559 2011-07-06 05:44:00 morphine 0.15
15947559 2011-07-07 05:44:00 morphine 0.15
15947559 2011-07-08 05:44:00 morphine 0.15
15947559 2011-07-09 02:00:00 morphine 0.1
15947559 2011-07-14 19:30:00 morphine 0.1
15947559 2011-07-15 18:30:00 morphine 0.15
15947559 2011-07-16 18:30:00 morphine 0.15
15947559 2011-07-17 03:00:00 morphine 0.1
15947559 2011-07-17 09:00:00 morphine 0.075
I'd really appreciate if someone can help me on how to approach this problem. Happy to provide additional details.
Upvotes: 2
Views: 419
Reputation: 30971
Start from defining such a variables:
h7 = pd.Timedelta('7H')
s1 = pd.Timedelta('1S')
(will be used several times).
To "expand" each row into a sequence of rows (for each day), define the following function:
def expand(row):
tm1, tm2 = row.START, row.END
t1 = tm1 if type(tm1).__name__ == 'Timestamp' else pd.to_datetime(tm1)
t2 = tm2 if type(tm2).__name__ == 'Timestamp' else pd.to_datetime(tm2)
rng = pd.date_range(start=t1 - h7, end=t2 - h7, freq='1D', closed='right',
normalize=True) + pd.Timedelta('7H')
ind = pd.DatetimeIndex([t1, *rng, t2 + s1])
res = pd.DataFrame({'ID': row.ID, 'START': ind[:-1], 'END': ind[1:] - s1,
'DRUG': row.DRUG, 'RATE': row.RATE})
res['HOURS'] = (res.END - res.START).dt.total_seconds().div(3600).round(2)
return res
and apply it, creating the result table:
df2 = pd.concat(df.apply(expand, axis=1).tolist(), ignore_index=True)
for now, without Days column.
Then add Days column, executing:
df2['Days'] = 'Day' + ((df2.START - h7).dt.date - df2.groupby('ID')
.START.transform('first').dt.date).dt.days.add(1).astype(str)
The Day number (after 'Day') is the difference between the current date and the date from the first row + 1.
Due to subtraction of h7 from the current date, this code works correctly, even if the change occurred between midnignt and 7 o'clock.
To present the result, run:
df2.sort_values('ID')
To trace the idea of this solution, run each statement separately and see the result.
Upvotes: 2