scale numerical values for different groups in python

Question

I want to scale the numerical values (similar like R's scale function) based on different groups.

Noted: when I talked about the scale, I am referring to this metric (x-group_mean)/group_std

Dataset (for demonstration the ideas) for example:

advertiser_id   value
10              11
10              22
10              2424
11              34
11              342342
.....

Desirable results:

advertiser_id   scaled_value
10              -0.58
10              -0.57
10              1.15
11              -0.707
11              0.707
.....

referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:

dt.groupby("advertiser_id").apply(scale)

but get an error:

ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)

In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.

I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!

CT Zhu · Accepted Answer

First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:

In [9]:

print df

   advertiser_id   value
0             10      11
1             10      22
2             10    2424
3             11      34
4             11  342342

In [10]:

print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))

      value
0 -0.581303
1 -0.573389
2  1.154691
3 -0.707107
4  0.707107

This matches R result.

2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.

scale numerical values for different groups in python

Answers (1)

Related Questions