Reputation: 1381
I have a dataframe representing customers orders with many columns, two of those being 'user_id' and 'dollar'.
for example :
user_id dollar
0 1 0.34592 5
1 1 0.02857 7
2 1 0.26672 6
3 1 0.34592 5
4 1 0.02857 9
5 1 0.26672 10
6 1 0.34592 6
[...]
7 40 0.02857 20
8 40 0.26672 19
9 40 0.34592 8
10 40 0.02857 18
11 40 0.26672 26
I want to normalize the value of dollar with respect to the other values in each users row. I want the following result for the previous example:
user_id dollar norm_dollar
0 1 0.34592 5 -1.02774024
1 1 0.02857 7 0.07905694
2 1 0.26672 6 -0.47434165
3 1 0.34592 5 -1.02774024
4 1 0.02857 9 1.18585412
5 1 0.26672 10 1.73925271
6 1 0.34592 6 -0.47434165
[...]
7 40 0.02857 20 0.7787612
8 40 0.26672 19 0.57109154
9 40 0.34592 8 -1.71327463
10 40 0.02857 18 0.36342189
EDIT:
I would like each results to be normalized wrt each user individually and not the values of the whole column, so for example with user2,[20,19,8,18] should be normalized as if the mean is the mean of user2 orders, here for example the mean is 16,25 and not the mean of the whole dataframe column.
I know how to do it with one user:
user1 = data.loc[data['user_id']==1]
data.loc[data['user_id']==1]['norm_dollar'] = sp.stats.mstats.zscore(user1['dollar'])
I tried to do it this way for all the users:
data.dollar.div(sp.stats.mstats.zscore(data.groupby('user_id').dollar))
But I got an error, do you have any idea on how to proceed?
Thank you
Upvotes: 0
Views: 1045
Reputation: 1607
That should work for you:
def apply_zscores(x):
x['norm_dollar'] = zscore(x['dollar'])
return x
df = df.groupby('id').apply(lambda x: apply_zscores(x))
Upvotes: 1
Reputation: 1215
Different ways to do this—like joining the groupby
dataframe back to the original—but I'm starting to like the use of transform
for stuff like this.
The syntax is still verbose, but I think it's more readable than the join method.
df['norm_dollar'] = (df['dollar']
- df.groupby('user_id')['dollar'].transform(np.mean)) \
/ df.groupby('user_id')['dollar'].transform(np.std)
If you need to specify degrees of freedom on np.std
, you can turn that into
lambda x: np.std(x, ddof=n)
Upvotes: 1