Reputation: 51582
I am trying to do a rolling sum of data frame. Sample of data frame:
cdateint severity cnt_alerts
0 20200511 1 48
1 20200511 2 89
2 20200511 3 5
3 20200511 4 1
4 20200512 1 48
5 20200512 2 69
6 20200512 3 1
7 20200512 4 0
8 20200513 1 47
9 20200513 2 62
10 20200513 3 1
11 20200513 4 1
12 20200514 1 50
13 20200514 2 44
14 20200514 3 2
15 20200514 4 4
16 20200515 1 48
17 20200515 2 44
18 20200515 3 6
19 20200515 4 5
20 20200516 1 47
21 20200516 2 23
22 20200516 3 0
23 20200516 4 1
24 20200517 1 48
25 20200517 2 14
26 20200517 3 1
27 20200517 4 0
So I want to roll, say 3 days at a time every 1 day and get the sum of cnt_alerts
per severity. This means that the first operation will be done for days (cdateint): [20200511, 20200512, 20200513] and the result should be:
The next operation will be for days (cdateint): [20200512, 20200513, 20200514]
And so on.
Note that for all days there will always be 4 severity levels and they are always sorted as you see on the example above.
UPDATE June 25 2020:
In addition I would like to get the cdateint
'labels'. i.e. for the first case, the label is 20200511 - 20200513
DATA
df = pd.DataFrame({'cdateint': {0: u'20200511', 1: u'20200511', 2: u'20200511', 3: u'20200511', 4: u'20200512', 5: u'20200512', 6: u'20200512', 7: u'20200512', 8: u'20200513', 9: u'20200513', 10: u'20200513', 11: u'20200513', 12: u'20200514', 13: u'20200514', 14: u'20200514', 15: u'20200514', 16: u'20200515', 17: u'20200515', 18: u'20200515', 19: u'20200515', 20: u'20200516', 21: u'20200516', 22: u'20200516', 23: u'20200516', 24: u'20200517', 25: u'20200517', 26: u'20200517', 27: u'20200517'},
'cnt_alerts': {0: 48, 1: 89, 2: 5, 3: 1, 4: 48, 5: 69, 6: 1, 7: 0, 8: 47, 9: 62, 10: 1, 11: 1, 12: 50, 13: 44, 14: 2, 15: 4, 16: 48, 17: 44, 18: 6, 19: 5, 20: 47, 21: 23, 22: 0, 23: 1, 24: 48, 25: 14, 26: 1, 27: 0},
'severity': {0: 1, 1: 2, 2: 3, 3: 4, 4: 1, 5: 2, 6: 3, 7: 4, 8: 1, 9: 2, 10: 3, 11: 4, 12: 1, 13: 2, 14: 3, 15: 4, 16: 1, 17: 2, 18: 3, 19: 4, 20: 1, 21: 2, 22: 3, 23: 4, 24: 1, 25: 2, 26: 3, 27: 4}})
Upvotes: 2
Views: 151
Reputation: 35115
I still lack the technical skills of inexperienced.
Created Flow:.
df['cdatedt'] = pd.to_datetime(df['cdateint'], format='%Y%m%d')
gb = df.set_index('cdatedt').groupby('severity').rolling('3d')['cnt_alerts'].sum().reset_index()
gb['cdateint'] = gb['cdatedt'].astype(str).str.replace('-','')
gb['label_flg'] = 1
gb['label_cnt'] = gb[['severity','label_flg']].groupby('severity').cumsum()
gb.loc[gb['label_cnt'] > 3,'label_cnt'] = 3
gb['label'] = ''
for i,c in gb['label_cnt'].iteritems():
if c == 1:
gb.loc[i,'label'] = gb.loc[i,'cdateint']
elif c == 2:
gb.loc[i,'label'] = gb.loc[i-1,'cdateint'] + '-' + gb.loc[i,'cdateint']
elif c == 3:
gb.loc[i,'label'] = gb.loc[i-2,'cdateint'] + '-' + gb.loc[i,'cdateint']
gb.drop(['label_flg', 'label_cnt'], axis=1, inplace=True)
gb
severity cdatedt cnt_alerts cdateint label
0 1 2020-05-11 48.0 20200511 20200511
1 1 2020-05-12 96.0 20200512 20200511-20200512
2 1 2020-05-13 143.0 20200513 20200511-20200513
3 1 2020-05-14 145.0 20200514 20200512-20200514
4 1 2020-05-15 145.0 20200515 20200513-20200515
5 1 2020-05-16 145.0 20200516 20200514-20200516
6 1 2020-05-17 143.0 20200517 20200515-20200517
7 2 2020-05-11 89.0 20200511 20200511
8 2 2020-05-12 158.0 20200512 20200511-20200512
9 2 2020-05-13 220.0 20200513 20200511-20200513
10 2 2020-05-14 175.0 20200514 20200512-20200514
11 2 2020-05-15 150.0 20200515 20200513-20200515
12 2 2020-05-16 111.0 20200516 20200514-20200516
13 2 2020-05-17 81.0 20200517 20200515-20200517
14 3 2020-05-11 5.0 20200511 20200511
15 3 2020-05-12 6.0 20200512 20200511-20200512
16 3 2020-05-13 7.0 20200513 20200511-20200513
17 3 2020-05-14 4.0 20200514 20200512-20200514
18 3 2020-05-15 9.0 20200515 20200513-20200515
19 3 2020-05-16 8.0 20200516 20200514-20200516
20 3 2020-05-17 7.0 20200517 20200515-20200517
21 4 2020-05-11 1.0 20200511 20200511
22 4 2020-05-12 1.0 20200512 20200511-20200512
23 4 2020-05-13 2.0 20200513 20200511-20200513
24 4 2020-05-14 5.0 20200514 20200512-20200514
25 4 2020-05-15 10.0 20200515 20200513-20200515
26 4 2020-05-16 10.0 20200516 20200514-20200516
27 4 2020-05-17 6.0 20200517 20200515-20200517
Upvotes: 1
Reputation: 29635
Assuming your data is well structured, you have each day and each severity for each day, and well ordered, you can avoid groupby
. Also if efficiency is your concern, I guess going to numpy would be interesting. here is one way
#parameters
nb_days = 3 #your rule in days
nb_sev = 4 #or df['severity'].nunique() if you don't know
#create an array with each column a severity
arr = df['cnt_alerts'].to_numpy().reshape(-1, nb_sev)
#cumsum this along the rows
arr = np.cumsum(arr, axis=0)
# you need to remove the values depending on the nb of days to get the sum of only X days
arr[nb_days:, :] -= arr[:-nb_days, :]
#create the column
df[f'cnt_alerts_{nb_days}D'] = arr.reshape(-1)
# create cdateint as wanted using shift to align the start date with the last date
df[f'cdateint_{nb_days}D'] = (df['cdateint'].shift(nb_sev*(nb_days-1)).bfill()
.str.cat(df['cdateint'], sep='-'))
and you get
print (df.head(20))
cdateint cnt_alerts severity cnt_alerts_3D cdateint_3D
0 20200511 48 1 48 20200511-20200511
1 20200511 89 2 89 20200511-20200511
2 20200511 5 3 5 20200511-20200511
3 20200511 1 4 1 20200511-20200511
4 20200512 48 1 96 20200511-20200512
5 20200512 69 2 158 20200511-20200512
6 20200512 1 3 6 20200511-20200512
7 20200512 0 4 1 20200511-20200512
8 20200513 47 1 143 20200511-20200513
9 20200513 62 2 220 20200511-20200513
10 20200513 1 3 7 20200511-20200513
11 20200513 1 4 2 20200511-20200513
12 20200514 50 1 145 20200512-20200514
13 20200514 44 2 175 20200512-20200514
14 20200514 2 3 4 20200512-20200514
15 20200514 4 4 5 20200512-20200514
16 20200515 48 1 145 20200513-20200515
17 20200515 44 2 150 20200513-20200515
18 20200515 6 3 9 20200513-20200515
19 20200515 5 4 10 20200513-20200515
If you don't care of the first days that does not meet the number of days criteria (here it is the first 2 days), then you can go with np.lib.stride_tricks.as_strided
, here a good post to understand what it is.
#create an array
arr = df['cnt_alerts'].to_numpy()
# get some parameters
nb_nan = (nb_days-1)*nb_sev
arr_is = arr.itemsize
# assign the column
df.loc[nb_nan:, f'cnt_alerts_strides'] = (
np.lib.stride_tricks.as_strided(arr, shape=(len(arr)-nb_nan,nb_days),
strides=(arr_is, arr_is*nb_sev))
.sum(axis=1)
)
print(df.head(20))
cdateint cnt_alerts severity cnt_alerts_strides
0 20200511 48 1 NaN
1 20200511 89 2 NaN
2 20200511 5 3 NaN
3 20200511 1 4 NaN
4 20200512 48 1 NaN
5 20200512 69 2 NaN
6 20200512 1 3 NaN
7 20200512 0 4 NaN
8 20200513 47 1 143.0
9 20200513 62 2 220.0
10 20200513 1 3 7.0
11 20200513 1 4 2.0
12 20200514 50 1 145.0
13 20200514 44 2 175.0
14 20200514 2 3 4.0
15 20200514 4 4 5.0
16 20200515 48 1 145.0
17 20200515 44 2 150.0
18 20200515 6 3 9.0
19 20200515 5 4 10.0
Note: with this method, I'm not sure how to do it for the column cdateint.
Upvotes: 2
Reputation: 29732
IIUC, use pandas.to_datetime
and groupby
with rolling
:
df["cdateint"] = pd.to_datetime(df["cdateint"].astype(str))
new_df = df.set_index("cdateint").groupby("severity").rolling("3d")["cnt_alerts"].sum()
new_df = new_df.reset_index()
print(new_df)
Output:
severity cdateint cnt_alerts
0 1 2020-05-11 48.0
1 1 2020-05-12 96.0
2 1 2020-05-13 143.0
3 1 2020-05-14 145.0
4 1 2020-05-15 145.0
5 1 2020-05-16 145.0
6 1 2020-05-17 143.0
7 2 2020-05-11 89.0
8 2 2020-05-12 158.0
9 2 2020-05-13 220.0
10 2 2020-05-14 175.0
11 2 2020-05-15 150.0
12 2 2020-05-16 111.0
13 2 2020-05-17 81.0
14 3 2020-05-11 5.0
15 3 2020-05-12 6.0
16 3 2020-05-13 7.0
17 3 2020-05-14 4.0
18 3 2020-05-15 9.0
19 3 2020-05-16 8.0
20 3 2020-05-17 7.0
21 4 2020-05-11 1.0
22 4 2020-05-12 1.0
23 4 2020-05-13 2.0
24 4 2020-05-14 5.0
25 4 2020-05-15 10.0
26 4 2020-05-16 10.0
27 4 2020-05-17 6.0
Upvotes: 2