Reputation: 8628
I have around 500,000 entries in my dataframe, and I'd like to apply some kind of smoothing to this data in order to significantly decrease the amount of entries.
This is a sample dataframe df
:
EVENT GRADE SERVICE_TIME QUEUE_TIME HOLD_TIME
AAA 3 170 20 12
AAA 4 165 15 10
AAA 3 172 24 12
AAA 3 105 5 10
BBB 5 40 10 10
BBB 5 60 10 8
The objective is to decrease the granularity of SERVICE_TIME
, QUEUE_TIME
and HOLD_TIME
.
To do this, I want to group entries by EVENT
and GRADE
, and to estimate average values of SERVICE_TIME
for a slice window of 60 seconds (i.e. [0;60],[61;120],[121;180],[181;240], and so on till the maximum value of SERVICE_TIME
). It's important that the slicing window's size can be flexibly changed from 60 to any other value.
So, the slicing should be applied to SERVICE_TIME
, while other columns QUEUE_TIME
and QUEUE_TIME
should be just averaged accordingly.
A sample result:
EVENT GRADE SERVICE_TIME QUEUE_TIME HOLD_TIME
AAA 3 171 22 12
AAA 3 105 5 10
AAA 4 165 15 10
BBB 5 50 10 9
So, two values of SERVICE_TIME
, 170 and 172, were merged because they both belong to the slicing window [121;180]. Therefore I calculated an average 171
and then I just average the corressponding values of QUEUE_TIME
and HOLD_TIME
.
How can I do this slicing?
If I had to just calculate an average (without slicing), then I would do it this way.
result = df.groupby(['EVENT','GRADE']).agg({'SERVICE_TIME': 'mean',
'QUEUE_TIME': 'mean',
'HOLD_TIME': 'mean'}).reset_index()
Upvotes: 0
Views: 57
Reputation: 5877
You can generate a new column which divides based on windows(here 'SERVICE_TIME_GROUP'). And then group using this new column. After grouping you can drop it.
WINDOW = 60
df['SERVICE_TIME_GROUP'] = df['SERVICE_TIME'].apply(lambda x : 1 if x==0 else math.ceil(x/(WINDOW*1.0));
result = df.groupby(['EVENT','GRADE','SERVICE_TIME_GROUP']).agg({'SERVICE_TIME': 'mean',
'QUEUE_TIME': 'mean',
'HOLD_TIME': 'mean'}).reset_index()
result = result.drop('SERVICE_TIME_GROUP', 1)
For [0;60],[61;120],[121;180],[181;240] , ceil function will work. One special case is 0 which needs to be handled separately. Otherwise just divide by window size and ceil it, you will get a group.
0 = > 1
1-60 => 1
61-120 => 2
121-180 => 3
...
Upvotes: 1
Reputation: 879103
You basically have the right idea, except that instead of grouping by just ['EVENT','GRADE']
, you'll want to also groupby SERVICE_GROUP
:
df['SERVICE_GROUP'] = (df['SERVICE_TIME']-1)//60
import numpy as np
import pandas as pd
df = pd.DataFrame({'EVENT': ['AAA', 'AAA', 'AAA', 'AAA', 'BBB', 'BBB'],
'GRADE': [3, 4, 3, 3, 5, 5],
'HOLD_TIME': [12, 10, 12, 10, 10, 8],
'QUEUE_TIME': [20, 15, 24, 5, 10, 10],
'SERVICE_TIME': [170, 165, 172, 105, 40, 60]})
df['SERVICE_GROUP'] = (df['SERVICE_TIME']-1)//60
result = (df.groupby(['EVENT','GRADE', 'SERVICE_GROUP'])
.agg({'SERVICE_TIME': 'mean',
'QUEUE_TIME': 'mean',
'HOLD_TIME': 'mean'}).reset_index())
result = result.drop('SERVICE_GROUP', axis=1)
print(result)
prints
EVENT GRADE QUEUE_TIME SERVICE_TIME HOLD_TIME
0 AAA 3 5 105 10
1 AAA 3 22 171 12
2 AAA 4 15 165 10
3 BBB 5 10 50 9
The -1 in (df['SERVICE_TIME']-1)//60
makes 60 part of SERVICE_GROUP
0.
Otherwise, SERVICE_TIMES
40 and 60 would not have been grouped together.
Thus df['SERVICE_GROUP'] = (df['SERVICE_TIME']-1)//60
causes the
SERVICE_TIME
groups to be the half-open intervals
(1, 60], (60, 120], (120, 180], ...
Upvotes: 2