Reputation: 3028
Assume that I have this dataframe.
import pandas as pd
data = {"Date_Time": ["2018-08-14 02:00:00", "2018-08-14 01:00:00", "2018-08-14 00:00:00", "2018-08-13 23:00:00", "2018-08-13 22:00:00", "2018-08-13 21:00:00", "2018-08-13 20:00:00"],
"Date": ["2018-08-14", "2018-08-14", "2018-08-14", "2018-08-13", "2018-08-13", "2018-08-13", "2018-08-13"],
"Duration":[3, 0, 0, 10, 2, 20, 1],
"value":[4, 3, 2, 1, 0, 23, 22],
"Remark":["e", "o", "k", "x", "c", "z", "a"]
}
df = pd.DataFrame.from_dict(data)
df
Date_Time Date Duration value Remark
0 2018-08-14 02:00:00 2018-08-14 3 4 e
1 2018-08-14 01:00:00 2018-08-14 0 3 o
2 2018-08-14 00:00:00 2018-08-14 0 2 k
3 2018-08-13 23:00:00 2018-08-13 10 1 x
4 2018-08-13 22:00:00 2018-08-13 2 0 c
5 2018-08-13 21:00:00 2018-08-13 20 23 z
6 2018-08-13 20:00:00 2018-08-13 1 22 a
I'd like to make the cumulative summation for each day. However, I want my day to start at 22:00 each day. So, the "value" column is the reference to hour of that day.
Or I can do
from datetime import timedelta
two_hours = pd.Timedelta(hours=2)
df["dummy_date"] = df["Date_Time"] + two_hours
df
Date_Time Date Duration value Remark dummy_date
0 2018-08-14 02:00:00 2018-08-14 3 4 e 2018-08-14 04:00:00
1 2018-08-14 01:00:00 2018-08-14 0 3 o 2018-08-14 03:00:00
2 2018-08-14 00:00:00 2018-08-14 0 2 k 2018-08-14 02:00:00
3 2018-08-13 23:00:00 2018-08-13 10 1 x 2018-08-14 01:00:00
4 2018-08-13 22:00:00 2018-08-13 2 0 c 2018-08-14 00:00:00
5 2018-08-13 21:00:00 2018-08-13 20 23 z 2018-08-13 23:00:00
6 2018-08-13 20:00:00 2018-08-13 1 22 a 2018-08-13 22:00:00
If that helps you to see how I count a day as shown in "dummy_date".
However, what I really want is I want to do cumulative summation on "Duration" column based on daily basis. So, the cumsum will start over every day when "value" is 0 or at mid-night of the "dummy_date" column. Below is the desired out put that I'm looking to achieve.
Date_Time Date Duration value Remark cum_sum dummy_date
0 2018-08-14 02:00:00 2018-08-14 3 4 e 15 2018-08-14 04:00:00
1 2018-08-14 01:00:00 2018-08-14 0 3 o 12 2018-08-14 03:00:00
2 2018-08-14 00:00:00 2018-08-14 0 2 k 12 2018-08-14 02:00:00
3 2018-08-13 23:00:00 2018-08-13 10 1 x 12 2018-08-14 01:00:00
4 2018-08-13 22:00:00 2018-08-13 2 0 c 2 2018-08-14 00:00:00
5 2018-08-13 21:00:00 2018-08-13 20 23 z 21 2018-08-13 23:00:00
6 2018-08-13 20:00:00 2018-08-13 1 22 a 1 2018-08-13 22:00:00
cum_sum of the first row is the sum of duration from the forth row upward, 2 + 10 + 0 + 0 + 3 = 15.
The second row is 2 + 10 + 0 + 0 = 12.
The third row is 2 + 10 + 0 = 12.
The third row is 2 + 10 = 12.
The forth row is 2 = 2.
The fifth row is the sum of the sixth row and the fifth row, 1 + 20 = 21 because the "value" is 23 which is considered to be another day or it can be seen from "dummy_date".
The sixth row is the 1.
My attempt was to use the rolling from pandas. However, I don't know how to set it to start over when it begins a new day.
df["cum_sum"] = df["Duration"].rolling(24, min_periods=1).sum()
df
Upvotes: 7
Views: 12511
Reputation: 863611
I believe need change order by iloc[::-1]
or sort_values
, then groupby
by dates
from dummy_date
column or Series
with DataFrameGroupBy.cumsum
:
df["cum_sum"] = df["Duration"].iloc[::-1].groupby(df['dummy_date'].dt.date).cumsum()
print (df)
Date_Time Date Duration value Remark dummy_date \
0 2018-08-14 02:00:00 2018-08-14 3 4 e 2018-08-14 04:00:00
1 2018-08-14 01:00:00 2018-08-14 0 3 o 2018-08-14 03:00:00
2 2018-08-14 00:00:00 2018-08-14 0 2 k 2018-08-14 02:00:00
3 2018-08-13 23:00:00 2018-08-13 10 1 x 2018-08-14 01:00:00
4 2018-08-13 22:00:00 2018-08-13 2 0 c 2018-08-14 00:00:00
5 2018-08-13 21:00:00 2018-08-13 20 23 z 2018-08-13 23:00:00
6 2018-08-13 20:00:00 2018-08-13 1 22 a 2018-08-13 22:00:00
cum_sum
0 15
1 12
2 12
3 12
4 2
5 21
6 1
If dont need dummy_date
in output:
two_hours = pd.Timedelta(hours=2)
dates = (df["Date_Time"] + two_hours).dt.date
df["cum_sum"] = df["Duration"].iloc[::-1].groupby(dates).cumsum()
print (df)
Date_Time Date Duration value Remark cum_sum
0 2018-08-14 02:00:00 2018-08-14 3 4 e 15
1 2018-08-14 01:00:00 2018-08-14 0 3 o 12
2 2018-08-14 00:00:00 2018-08-14 0 2 k 12
3 2018-08-13 23:00:00 2018-08-13 10 1 x 12
4 2018-08-13 22:00:00 2018-08-13 2 0 c 2
5 2018-08-13 21:00:00 2018-08-13 20 23 z 21
6 2018-08-13 20:00:00 2018-08-13 1 22 a 1
df = df.sort_values('Date_Time')
two_hours = pd.Timedelta(hours=2)
dates = (df["Date_Time"] + two_hours).dt.date
df["cum_sum"] = df["Duration"].groupby(dates).cumsum()
df = df.sort_index()
print (df)
Date_Time Date Duration value Remark cum_sum
0 2018-08-14 02:00:00 2018-08-14 3 4 e 15
1 2018-08-14 01:00:00 2018-08-14 0 3 o 12
2 2018-08-14 00:00:00 2018-08-14 0 2 k 12
3 2018-08-13 23:00:00 2018-08-13 10 1 x 12
4 2018-08-13 22:00:00 2018-08-13 2 0 c 2
5 2018-08-13 21:00:00 2018-08-13 20 23 z 21
6 2018-08-13 20:00:00 2018-08-13 1 22 a 1
Upvotes: 8