Reputation: 8628
I have the following pandas dataframe (it is just a small extract):
GROUP AVG_PERCENT_EVAL_1 AVG_PERCENT_NEGATIVE AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA 19 11.000000 25.000000 163.000000
AAAAA 22 2.000000 146.364198 332.761317
AAAAA 23 0.500000 44.068225 302.708639
AAAAA 24 1.000000 122.672215 322.359795
AAAAA 26 1.000000 143.594896 317.940989
BBBBB 18 1.000000 121.225692 319.292226
BBBBB 19 1.000000 40.054707 201.096152
BBBBB 21 0.333333 29.221458 207.142059
BBBBB 27 2.000000 103.796290 313.685358
I need to make AVG_PERCENT_EVAL_1
more continuous, which means that, instead of exact values e.g. 18
, 19
, 20
, etc., I want to put ranges e.g. 18-20
,21-23
, and so on till around 40.
The data should be grouped by GROUP
and then AVG_PERCENT_NEGATIVE
, AVG_TOTAL_WAIT_TIME
and AVG_TOTAL_SERVICE_TIME
should be averaged for each corresponding range.
IMPORTANT: Let's take the range 18-20
. The group AAA
has corresponding entry with AVG_PERCENT_EVAL_1
equal to 19
, while the group BBBBB
has two entries falling in this range - 18
and 19
. The values of AVG_PERCENT_NEGATIVE
, AVG_TOTAL_WAIT_TIME
and AVG_TOTAL_SERVICE_TIME
should be these ones:
GROUP AVG_PERCENT_RANGE AVG_PERCENT_NEGATIVE AVG_TOTAL_WAIT_TIME AVG_TOTAL_SERVICE_TIME
AAAAA 18-20 11.00 25.00 163.000000
BBBBB 18-20 1.00 80.64 260,19
I know how to group data by concrete columns and then calculate 'mean' or 'count' using agg
. However, in this case I don't know how to create ranges for AVG_PERCENT_EVAL_1
. Also, I don't know how to define that lacking entries should not be considered as 0
. For instance, in the above example there is no information about AAAAA
with AVG_PERCENT_EVAL_1
equal to 18
and 20
, therefore I just want to take the values for 19
without averaging with 0
values for 18
and 20
.
Upvotes: 1
Views: 709
Reputation: 29711
Steps:
1) Bin AVG_PERCENT_EVAL_1 into appropriate labels using pd.cut()
by specifying a bin
sequence.
Specifying include_lowest=True
would take care of inclusiveness of the left endpoint "["
whereas right=False
would make the right endpoint an open interval ")"
.
2) Using the returned categories, re-label them as per desired requirements.
3) Peform groupby
making GROUP and the newly computed binned ranges as the grouped key, aggregate the means of all present columns after dropping AVG_PERCENT_EVAL_1 from them.
binning portion:
step=3
kwargs = dict(include_lowest=True, right=False)
bins = pd.cut(df.AVG_PERCENT_EVAL_1, bins=np.arange(18,40+step,step), **kwargs)
labels = [(str(int(cat[1:3])) + "-" + str(int(cat[5:7])-1)) for cat in bins.cat.categories]
bins.cat.categories = labels
assign and groupby.agg()
:
df = df.assign(AVG_PERCENT_RANGE=bins).drop("AVG_PERCENT_EVAL_1", axis=1)
df.groupby(['GROUP', 'AVG_PERCENT_RANGE'], as_index=False).agg('mean')
Upvotes: 1