How to apply slicing to the dataframe?

Question

I have around 500,000 entries in my dataframe, and I'd like to apply some kind of smoothing to this data in order to significantly decrease the amount of entries.

This is a sample dataframe df:

EVENT   GRADE   SERVICE_TIME   QUEUE_TIME   HOLD_TIME
AAA     3       170            20           12
AAA     4       165            15           10
AAA     3       172            24           12
AAA     3       105            5            10
BBB     5       40             10           10
BBB     5       60             10           8

The objective is to decrease the granularity of SERVICE_TIME, QUEUE_TIME and HOLD_TIME.

To do this, I want to group entries by EVENT and GRADE, and to estimate average values of SERVICE_TIME for a slice window of 60 seconds (i.e. [0;60],[61;120],[121;180],[181;240], and so on till the maximum value of SERVICE_TIME). It's important that the slicing window's size can be flexibly changed from 60 to any other value. So, the slicing should be applied to SERVICE_TIME, while other columns QUEUE_TIME and QUEUE_TIME should be just averaged accordingly.

A sample result:

EVENT   GRADE   SERVICE_TIME   QUEUE_TIME   HOLD_TIME
AAA     3       171            22           12
AAA     3       105            5            10
AAA     4       165            15           10
BBB     5       50             10           9

So, two values of SERVICE_TIME, 170 and 172, were merged because they both belong to the slicing window [121;180]. Therefore I calculated an average 171 and then I just average the corressponding values of QUEUE_TIME and HOLD_TIME.

How can I do this slicing?

If I had to just calculate an average (without slicing), then I would do it this way.

result = df.groupby(['EVENT','GRADE']).agg({'SERVICE_TIME': 'mean', 
                                            'QUEUE_TIME': 'mean', 
                                            'HOLD_TIME': 'mean'}).reset_index()

unutbu · Accepted Answer

You basically have the right idea, except that instead of grouping by just ['EVENT','GRADE'], you'll want to also groupby SERVICE_GROUP:

df['SERVICE_GROUP'] = (df['SERVICE_TIME']-1)//60

import numpy as np
import pandas as pd

df = pd.DataFrame({'EVENT': ['AAA', 'AAA', 'AAA', 'AAA', 'BBB', 'BBB'],
 'GRADE': [3, 4, 3, 3, 5, 5],
 'HOLD_TIME': [12, 10, 12, 10, 10, 8],
 'QUEUE_TIME': [20, 15, 24, 5, 10, 10],
 'SERVICE_TIME': [170, 165, 172, 105, 40, 60]})

df['SERVICE_GROUP'] = (df['SERVICE_TIME']-1)//60

result = (df.groupby(['EVENT','GRADE', 'SERVICE_GROUP'])
          .agg({'SERVICE_TIME': 'mean', 
                'QUEUE_TIME': 'mean', 
                'HOLD_TIME': 'mean'}).reset_index())
result = result.drop('SERVICE_GROUP', axis=1)
print(result)

prints

  EVENT  GRADE  QUEUE_TIME  SERVICE_TIME  HOLD_TIME
0   AAA      3           5           105         10
1   AAA      3          22           171         12
2   AAA      4          15           165         10
3   BBB      5          10            50          9

The -1 in (df['SERVICE_TIME']-1)//60 makes 60 part of SERVICE_GROUP 0. Otherwise, SERVICE_TIMES 40 and 60 would not have been grouped together.

Thus df['SERVICE_GROUP'] = (df['SERVICE_TIME']-1)//60 causes the SERVICE_TIME groups to be the half-open intervals

(1, 60], (60, 120], (120, 180], ...

How to apply slicing to the dataframe?

Answers (2)

Related Questions