Dinosaurius
Dinosaurius

Reputation: 8628

How to apply slicing to the dataframe?

I have around 500,000 entries in my dataframe, and I'd like to apply some kind of smoothing to this data in order to significantly decrease the amount of entries.

This is a sample dataframe df:

EVENT   GRADE   SERVICE_TIME   QUEUE_TIME   HOLD_TIME
AAA     3       170            20           12
AAA     4       165            15           10
AAA     3       172            24           12
AAA     3       105            5            10
BBB     5       40             10           10
BBB     5       60             10           8

The objective is to decrease the granularity of SERVICE_TIME, QUEUE_TIME and HOLD_TIME.

To do this, I want to group entries by EVENT and GRADE, and to estimate average values of SERVICE_TIME for a slice window of 60 seconds (i.e. [0;60],[61;120],[121;180],[181;240], and so on till the maximum value of SERVICE_TIME). It's important that the slicing window's size can be flexibly changed from 60 to any other value. So, the slicing should be applied to SERVICE_TIME, while other columns QUEUE_TIME and QUEUE_TIME should be just averaged accordingly.

A sample result:

EVENT   GRADE   SERVICE_TIME   QUEUE_TIME   HOLD_TIME
AAA     3       171            22           12
AAA     3       105            5            10
AAA     4       165            15           10
BBB     5       50             10           9

So, two values of SERVICE_TIME, 170 and 172, were merged because they both belong to the slicing window [121;180]. Therefore I calculated an average 171 and then I just average the corressponding values of QUEUE_TIME and HOLD_TIME.

How can I do this slicing?

If I had to just calculate an average (without slicing), then I would do it this way.

result = df.groupby(['EVENT','GRADE']).agg({'SERVICE_TIME': 'mean', 
                                            'QUEUE_TIME': 'mean', 
                                            'HOLD_TIME': 'mean'}).reset_index() 

Upvotes: 0

Views: 57

Answers (2)

prem kumar
prem kumar

Reputation: 5877

You can generate a new column which divides based on windows(here 'SERVICE_TIME_GROUP'). And then group using this new column. After grouping you can drop it.

WINDOW = 60
df['SERVICE_TIME_GROUP'] = df['SERVICE_TIME'].apply(lambda x : 1 if x==0 else math.ceil(x/(WINDOW*1.0));

result = df.groupby(['EVENT','GRADE','SERVICE_TIME_GROUP']).agg({'SERVICE_TIME': 'mean', 
                                            'QUEUE_TIME': 'mean', 
                                            'HOLD_TIME': 'mean'}).reset_index() 

result = result.drop('SERVICE_TIME_GROUP', 1)

For [0;60],[61;120],[121;180],[181;240] , ceil function will work. One special case is 0 which needs to be handled separately. Otherwise just divide by window size and ceil it, you will get a group.
0 = > 1
1-60 => 1
61-120 => 2
121-180 => 3
...

Upvotes: 1

unutbu
unutbu

Reputation: 879103

You basically have the right idea, except that instead of grouping by just ['EVENT','GRADE'], you'll want to also groupby SERVICE_GROUP:

df['SERVICE_GROUP'] = (df['SERVICE_TIME']-1)//60

import numpy as np
import pandas as pd

df = pd.DataFrame({'EVENT': ['AAA', 'AAA', 'AAA', 'AAA', 'BBB', 'BBB'],
 'GRADE': [3, 4, 3, 3, 5, 5],
 'HOLD_TIME': [12, 10, 12, 10, 10, 8],
 'QUEUE_TIME': [20, 15, 24, 5, 10, 10],
 'SERVICE_TIME': [170, 165, 172, 105, 40, 60]})

df['SERVICE_GROUP'] = (df['SERVICE_TIME']-1)//60

result = (df.groupby(['EVENT','GRADE', 'SERVICE_GROUP'])
          .agg({'SERVICE_TIME': 'mean', 
                'QUEUE_TIME': 'mean', 
                'HOLD_TIME': 'mean'}).reset_index())
result = result.drop('SERVICE_GROUP', axis=1)
print(result)

prints

  EVENT  GRADE  QUEUE_TIME  SERVICE_TIME  HOLD_TIME
0   AAA      3           5           105         10
1   AAA      3          22           171         12
2   AAA      4          15           165         10
3   BBB      5          10            50          9

The -1 in (df['SERVICE_TIME']-1)//60 makes 60 part of SERVICE_GROUP 0. Otherwise, SERVICE_TIMES 40 and 60 would not have been grouped together.

Thus df['SERVICE_GROUP'] = (df['SERVICE_TIME']-1)//60 causes the SERVICE_TIME groups to be the half-open intervals

(1, 60], (60, 120], (120, 180], ...

Upvotes: 2

Related Questions