Sotos
Sotos

Reputation: 51582

Rolling operations of grouped data frame

I am trying to do a rolling sum of data frame. Sample of data frame:

     cdateint  severity  cnt_alerts
0    20200511         1          48
1    20200511         2          89
2    20200511         3           5
3    20200511         4           1
4    20200512         1          48
5    20200512         2          69
6    20200512         3           1
7    20200512         4           0
8    20200513         1          47
9    20200513         2          62
10   20200513         3           1
11   20200513         4           1
12   20200514         1          50
13   20200514         2          44
14   20200514         3           2
15   20200514         4           4
16   20200515         1          48
17   20200515         2          44
18   20200515         3           6
19   20200515         4           5
20   20200516         1          47
21   20200516         2          23
22   20200516         3           0
23   20200516         4           1
24   20200517         1          48
25   20200517         2          14
26   20200517         3           1
27   20200517         4           0

So I want to roll, say 3 days at a time every 1 day and get the sum of cnt_alerts per severity. This means that the first operation will be done for days (cdateint): [20200511, 20200512, 20200513] and the result should be:

  1. For severity 1: [48 + 48 + 47]
  2. For severity 2: [89 + 69 + 62]
  3. For severity 3: [5 + 1 + 1]
  4. For severity 4: [1 + 0 + 1]

The next operation will be for days (cdateint): [20200512, 20200513, 20200514]

  1. For severity 1: [48 + 47 + 50]
  2. For severity 2: [69 + 62 + 44]
  3. For severity 3: [1 + 1 + 2]
  4. For severity 4: [0 + 1 + 4]

And so on.

Note that for all days there will always be 4 severity levels and they are always sorted as you see on the example above.


UPDATE June 25 2020:

In addition I would like to get the cdateint 'labels'. i.e. for the first case, the label is 20200511 - 20200513

DATA

df = pd.DataFrame({'cdateint': {0: u'20200511', 1: u'20200511', 2: u'20200511', 3: u'20200511', 4: u'20200512', 5: u'20200512', 6: u'20200512', 7: u'20200512', 8: u'20200513', 9: u'20200513', 10: u'20200513', 11: u'20200513', 12: u'20200514', 13: u'20200514', 14: u'20200514', 15: u'20200514', 16: u'20200515', 17: u'20200515', 18: u'20200515', 19: u'20200515', 20: u'20200516', 21: u'20200516', 22: u'20200516', 23: u'20200516', 24: u'20200517', 25: u'20200517', 26: u'20200517', 27: u'20200517'}, 
                   'cnt_alerts': {0: 48, 1: 89, 2: 5, 3: 1, 4: 48, 5: 69, 6: 1, 7: 0, 8: 47, 9: 62, 10: 1, 11: 1, 12: 50, 13: 44, 14: 2, 15: 4, 16: 48, 17: 44, 18: 6, 19: 5, 20: 47, 21: 23, 22: 0, 23: 1, 24: 48, 25: 14, 26: 1, 27: 0}, 
                   'severity': {0: 1, 1: 2, 2: 3, 3: 4, 4: 1, 5: 2, 6: 3, 7: 4, 8: 1, 9: 2, 10: 3, 11: 4, 12: 1, 13: 2, 14: 3, 15: 4, 16: 1, 17: 2, 18: 3, 19: 4, 20: 1, 21: 2, 22: 3, 23: 4, 24: 1, 25: 2, 26: 3, 27: 4}})

Upvotes: 2

Views: 151

Answers (3)

r-beginners
r-beginners

Reputation: 35115

I still lack the technical skills of inexperienced.

Created Flow:.

  1. counting by `groupby'.
  2. Add a flag for the label.
  3. calculate the cumulative sum of its added columns
  4. update the maximum value in that cumulative sum column to 3
  5. update the label column by executing a function that is determined by the value of the column of its cumulative sum
df['cdatedt'] = pd.to_datetime(df['cdateint'], format='%Y%m%d')
gb = df.set_index('cdatedt').groupby('severity').rolling('3d')['cnt_alerts'].sum().reset_index()
gb['cdateint'] = gb['cdatedt'].astype(str).str.replace('-','')
gb['label_flg'] = 1 
gb['label_cnt'] = gb[['severity','label_flg']].groupby('severity').cumsum()
gb.loc[gb['label_cnt'] > 3,'label_cnt'] = 3
gb['label'] = ''
for i,c in gb['label_cnt'].iteritems():
    if c == 1:
        gb.loc[i,'label'] = gb.loc[i,'cdateint']
    elif c == 2:
        gb.loc[i,'label'] =  gb.loc[i-1,'cdateint'] + '-' +  gb.loc[i,'cdateint']
    elif c == 3:
        gb.loc[i,'label'] =  gb.loc[i-2,'cdateint'] + '-' +  gb.loc[i,'cdateint']      
gb.drop(['label_flg', 'label_cnt'], axis=1, inplace=True)
gb
    severity    cdatedt cnt_alerts  cdateint    label
0   1   2020-05-11  48.0    20200511    20200511
1   1   2020-05-12  96.0    20200512    20200511-20200512
2   1   2020-05-13  143.0   20200513    20200511-20200513
3   1   2020-05-14  145.0   20200514    20200512-20200514
4   1   2020-05-15  145.0   20200515    20200513-20200515
5   1   2020-05-16  145.0   20200516    20200514-20200516
6   1   2020-05-17  143.0   20200517    20200515-20200517
7   2   2020-05-11  89.0    20200511    20200511
8   2   2020-05-12  158.0   20200512    20200511-20200512
9   2   2020-05-13  220.0   20200513    20200511-20200513
10  2   2020-05-14  175.0   20200514    20200512-20200514
11  2   2020-05-15  150.0   20200515    20200513-20200515
12  2   2020-05-16  111.0   20200516    20200514-20200516
13  2   2020-05-17  81.0    20200517    20200515-20200517
14  3   2020-05-11  5.0     20200511    20200511
15  3   2020-05-12  6.0     20200512    20200511-20200512
16  3   2020-05-13  7.0     20200513    20200511-20200513
17  3   2020-05-14  4.0     20200514    20200512-20200514
18  3   2020-05-15  9.0     20200515    20200513-20200515
19  3   2020-05-16  8.0     20200516    20200514-20200516
20  3   2020-05-17  7.0     20200517    20200515-20200517
21  4   2020-05-11  1.0     20200511    20200511
22  4   2020-05-12  1.0     20200512    20200511-20200512
23  4   2020-05-13  2.0     20200513    20200511-20200513
24  4   2020-05-14  5.0     20200514    20200512-20200514
25  4   2020-05-15  10.0    20200515    20200513-20200515
26  4   2020-05-16  10.0    20200516    20200514-20200516
27  4   2020-05-17  6.0     20200517    20200515-20200517

Upvotes: 1

Ben.T
Ben.T

Reputation: 29635

Assuming your data is well structured, you have each day and each severity for each day, and well ordered, you can avoid groupby. Also if efficiency is your concern, I guess going to numpy would be interesting. here is one way

#parameters
nb_days = 3 #your rule in days
nb_sev = 4 #or df['severity'].nunique() if you don't know

#create an array with each column a severity
arr = df['cnt_alerts'].to_numpy().reshape(-1, nb_sev)
#cumsum this along the rows
arr = np.cumsum(arr, axis=0)
# you need to remove the values depending on the nb of days to get the sum of only X days
arr[nb_days:, :] -= arr[:-nb_days, :]
#create the column
df[f'cnt_alerts_{nb_days}D'] = arr.reshape(-1)
# create cdateint as wanted using shift to align the start date with the last date 
df[f'cdateint_{nb_days}D'] = (df['cdateint'].shift(nb_sev*(nb_days-1)).bfill()
                                 .str.cat(df['cdateint'], sep='-'))

and you get

print (df.head(20))
    cdateint  cnt_alerts  severity  cnt_alerts_3D        cdateint_3D
0   20200511          48         1             48  20200511-20200511
1   20200511          89         2             89  20200511-20200511
2   20200511           5         3              5  20200511-20200511
3   20200511           1         4              1  20200511-20200511
4   20200512          48         1             96  20200511-20200512
5   20200512          69         2            158  20200511-20200512
6   20200512           1         3              6  20200511-20200512
7   20200512           0         4              1  20200511-20200512
8   20200513          47         1            143  20200511-20200513
9   20200513          62         2            220  20200511-20200513
10  20200513           1         3              7  20200511-20200513
11  20200513           1         4              2  20200511-20200513
12  20200514          50         1            145  20200512-20200514
13  20200514          44         2            175  20200512-20200514
14  20200514           2         3              4  20200512-20200514
15  20200514           4         4              5  20200512-20200514
16  20200515          48         1            145  20200513-20200515
17  20200515          44         2            150  20200513-20200515
18  20200515           6         3              9  20200513-20200515
19  20200515           5         4             10  20200513-20200515

If you don't care of the first days that does not meet the number of days criteria (here it is the first 2 days), then you can go with np.lib.stride_tricks.as_strided, here a good post to understand what it is.

#create an array
arr = df['cnt_alerts'].to_numpy()
# get some parameters
nb_nan = (nb_days-1)*nb_sev
arr_is = arr.itemsize
# assign the column
df.loc[nb_nan:, f'cnt_alerts_strides'] = (
    np.lib.stride_tricks.as_strided(arr, shape=(len(arr)-nb_nan,nb_days), 
                                    strides=(arr_is, arr_is*nb_sev))
      .sum(axis=1)
    )
print(df.head(20))
    cdateint  cnt_alerts  severity  cnt_alerts_strides
0   20200511          48         1                 NaN
1   20200511          89         2                 NaN
2   20200511           5         3                 NaN
3   20200511           1         4                 NaN
4   20200512          48         1                 NaN
5   20200512          69         2                 NaN
6   20200512           1         3                 NaN
7   20200512           0         4                 NaN
8   20200513          47         1               143.0
9   20200513          62         2               220.0
10  20200513           1         3                 7.0
11  20200513           1         4                 2.0
12  20200514          50         1               145.0
13  20200514          44         2               175.0
14  20200514           2         3                 4.0
15  20200514           4         4                 5.0
16  20200515          48         1               145.0
17  20200515          44         2               150.0
18  20200515           6         3                 9.0
19  20200515           5         4                10.0

Note: with this method, I'm not sure how to do it for the column cdateint.

Upvotes: 2

Chris
Chris

Reputation: 29732

IIUC, use pandas.to_datetime and groupby with rolling:

df["cdateint"] = pd.to_datetime(df["cdateint"].astype(str))
new_df = df.set_index("cdateint").groupby("severity").rolling("3d")["cnt_alerts"].sum()
new_df = new_df.reset_index()
print(new_df)

Output:

    severity   cdateint  cnt_alerts
0          1 2020-05-11        48.0
1          1 2020-05-12        96.0
2          1 2020-05-13       143.0
3          1 2020-05-14       145.0
4          1 2020-05-15       145.0
5          1 2020-05-16       145.0
6          1 2020-05-17       143.0
7          2 2020-05-11        89.0
8          2 2020-05-12       158.0
9          2 2020-05-13       220.0
10         2 2020-05-14       175.0
11         2 2020-05-15       150.0
12         2 2020-05-16       111.0
13         2 2020-05-17        81.0
14         3 2020-05-11         5.0
15         3 2020-05-12         6.0
16         3 2020-05-13         7.0
17         3 2020-05-14         4.0
18         3 2020-05-15         9.0
19         3 2020-05-16         8.0
20         3 2020-05-17         7.0
21         4 2020-05-11         1.0
22         4 2020-05-12         1.0
23         4 2020-05-13         2.0
24         4 2020-05-14         5.0
25         4 2020-05-15        10.0
26         4 2020-05-16        10.0
27         4 2020-05-17         6.0

Upvotes: 2

Related Questions