user2505650
user2505650

Reputation: 1381

Zscore Normalize column in Dataframe using groupby

I have a dataframe representing customers orders with many columns, two of those being 'user_id' and 'dollar'.

for example :

    user_id   dollar 
0  1  0.34592  5
1  1  0.02857  7
2  1  0.26672  6
3  1  0.34592  5
4  1  0.02857  9
5  1  0.26672  10
6  1  0.34592  6
      [...]
7  40  0.02857  20
8  40  0.26672  19
9  40  0.34592  8
10 40  0.02857  18
11 40  0.26672  26

I want to normalize the value of dollar with respect to the other values in each users row. I want the following result for the previous example:

 user_id   dollar norm_dollar
0  1  0.34592  5  -1.02774024
1  1  0.02857  7  0.07905694
2  1  0.26672  6  -0.47434165
3  1  0.34592  5  -1.02774024
4  1  0.02857  9  1.18585412
5  1  0.26672  10  1.73925271
6  1  0.34592  6  -0.47434165
      [...]
7  40  0.02857  20  0.7787612
8  40  0.26672  19  0.57109154
9  40  0.34592  8   -1.71327463
10 40  0.02857  18  0.36342189

EDIT:

I would like each results to be normalized wrt each user individually and not the values of the whole column, so for example with user2,[20,19,8,18] should be normalized as if the mean is the mean of user2 orders, here for example the mean is 16,25 and not the mean of the whole dataframe column.

I know how to do it with one user:

user1 = data.loc[data['user_id']==1]
data.loc[data['user_id']==1]['norm_dollar'] = sp.stats.mstats.zscore(user1['dollar'])

I tried to do it this way for all the users:

data.dollar.div(sp.stats.mstats.zscore(data.groupby('user_id').dollar))

But I got an error, do you have any idea on how to proceed?

Thank you

Upvotes: 0

Views: 1045

Answers (2)

Kacper Wolkowski
Kacper Wolkowski

Reputation: 1607

That should work for you:

def apply_zscores(x):
    x['norm_dollar'] = zscore(x['dollar'])
    return x

df = df.groupby('id').apply(lambda x: apply_zscores(x))

Upvotes: 1

Niels Joaquin
Niels Joaquin

Reputation: 1215

Different ways to do this—like joining the groupby dataframe back to the original—but I'm starting to like the use of transform for stuff like this.

The syntax is still verbose, but I think it's more readable than the join method.

df['norm_dollar'] = (df['dollar']
                        - df.groupby('user_id')['dollar'].transform(np.mean)) \
                        / df.groupby('user_id')['dollar'].transform(np.std)

If you need to specify degrees of freedom on np.std, you can turn that into

lambda x: np.std(x, ddof=n)

Upvotes: 1

Related Questions