Reputation: 293
I have a pandas dataframe "df" with 2 columns named as
["patientVisit_id", "demandTime",]
,
where patientVisit_id
is an int64 column, and demandTime
is datetime.
I want to write a python function to group by the number of patientVisit_id
each day from hour X for the next Y hours. The output would be two columns ds
which indicates the date (starting from the minimum date in the data and ending with the maximum date), and total_patient
which indicates the number of patients starting from X on ds
for the next Y hours.
Definitely if Y is greater than 24, it would be overlap but it is fine.
import pandas as pd
df = pd.DataFrame({
"patientVisit_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"demandTime": pd.to_datetime([
"2023-06-06 06:00:00", "2023-06-06 07:00:00", "2023-06-06 08:00:00",
"2023-06-06 09:00:00", "2023-06-06 10:00:00", "2023-06-07 02:00:00",
"2023-06-07 12:00:00", "2023-06-07 13:00:00", "2023-06-07 14:00:00"
])
})
def group_by_duration(df, x, y):
return output_df
Given the example dataframe above, if x = 6 and y = 22, the output would be
import datetime as dt
df = pd.DataFrame({
"ds": [dt.date(2023,6,6), dt.date(2023,6,7)],
"total_patient": [6, 3]
})
Since from 2023/6/6 6AM to 2023/6/7 4AM (next 22 hours) there are 6 patients and from 2023/6/7 6AM to 2023/6/8 4AM there are 3 patients.
I tried pd.resample
and pd.crosstab
but they did not seem to work. I feel it still has to be some sort of groupby but could not figure it out clearly. Many thanks in advance.
Upvotes: 0
Views: 40
Reputation: 863731
There is overlapping, so processing each day separately and get number of patient by sum of True
s in boolean mask:
def group_by_duration(df, x, y):
out = []
for d in df['demandTime'].dt.normalize().drop_duplicates():
start = d + pd.Timedelta(x, 'hour')
end = start + pd.Timedelta(y, 'hour')
m = df['demandTime'].between(start, end)
out.append((d.date(), m.sum()))
return pd.DataFrame(out, columns=["ds", "total_patient"])
print (group_by_duration(df, 6, 22))
ds total_patient
0 2023-06-06 6
1 2023-06-07 3
If no overlaping is possible solution change by:
def group_by_duration(df, x, y):
df['ds'] = df['demandTime'].dt.date
start = df['demandTime'].dt.normalize() + pd.Timedelta(x, 'hour')
end = start + pd.Timedelta(y, 'hour')
df['total_patient'] = df['demandTime'].between(start, end)
return df.groupby('ds', as_index=False)["total_patient"].sum()
print (group_by_duration(df, 6, 5))
ds total_patient
0 2023-06-06 5
1 2023-06-07 0
Upvotes: 0