Dinosaurius
Dinosaurius

Reputation: 8628

How to group data by ranges?

I have the following pandas dataframe (it is just a small extract):

GROUP   AVG_PERCENT_EVAL_1  AVG_PERCENT_NEGATIVE    AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA   19                  11.000000               25.000000           163.000000
AAAAA   22                  2.000000                146.364198          332.761317
AAAAA   23                  0.500000                44.068225           302.708639
AAAAA   24                  1.000000                122.672215          322.359795
AAAAA   26                  1.000000                143.594896          317.940989
BBBBB   18                  1.000000                121.225692          319.292226
BBBBB   19                  1.000000                40.054707           201.096152
BBBBB   21                  0.333333                29.221458           207.142059
BBBBB   27                  2.000000                103.796290          313.685358

I need to make AVG_PERCENT_EVAL_1 more continuous, which means that, instead of exact values e.g. 18, 19, 20, etc., I want to put ranges e.g. 18-20,21-23, and so on till around 40.

The data should be grouped by GROUP and then AVG_PERCENT_NEGATIVE, AVG_TOTAL_WAIT_TIME and AVG_TOTAL_SERVICE_TIME should be averaged for each corresponding range.

IMPORTANT: Let's take the range 18-20. The group AAA has corresponding entry with AVG_PERCENT_EVAL_1 equal to 19, while the group BBBBB has two entries falling in this range - 18 and 19. The values of AVG_PERCENT_NEGATIVE, AVG_TOTAL_WAIT_TIME and AVG_TOTAL_SERVICE_TIME should be these ones:

GROUP   AVG_PERCENT_RANGE   AVG_PERCENT_NEGATIVE    AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA   18-20               11.00                   25.00               163.000000
BBBBB   18-20               1.00                    80.64               260,19

I know how to group data by concrete columns and then calculate 'mean' or 'count' using agg. However, in this case I don't know how to create ranges for AVG_PERCENT_EVAL_1. Also, I don't know how to define that lacking entries should not be considered as 0. For instance, in the above example there is no information about AAAAA with AVG_PERCENT_EVAL_1 equal to 18 and 20, therefore I just want to take the values for 19 without averaging with 0 values for 18 and 20.

Upvotes: 1

Views: 709

Answers (1)

Nickil Maveli
Nickil Maveli

Reputation: 29711

Steps:

1) Bin AVG_PERCENT_EVAL_1 into appropriate labels using pd.cut() by specifying a bin sequence.

Specifying include_lowest=True would take care of inclusiveness of the left endpoint "[" whereas right=False would make the right endpoint an open interval ")".

2) Using the returned categories, re-label them as per desired requirements.

3) Peform groupby making GROUP and the newly computed binned ranges as the grouped key, aggregate the means of all present columns after dropping AVG_PERCENT_EVAL_1 from them.


binning portion:

step=3
kwargs = dict(include_lowest=True, right=False)
bins = pd.cut(df.AVG_PERCENT_EVAL_1, bins=np.arange(18,40+step,step), **kwargs)
labels = [(str(int(cat[1:3])) + "-" + str(int(cat[5:7])-1)) for cat in bins.cat.categories]
bins.cat.categories = labels

assign and groupby.agg():

df = df.assign(AVG_PERCENT_RANGE=bins).drop("AVG_PERCENT_EVAL_1", axis=1)
df.groupby(['GROUP', 'AVG_PERCENT_RANGE'], as_index=False).agg('mean')

enter image description here

Upvotes: 1

Related Questions