Reputation: 3086
I need to compute a standard deviation along columns (axis=0), but since two columns have different range (q1_5 = [0 - 15], q6_9 = [0-4]) , I must normalize by the max. value (column q1_5 by 15 and column q6_9 by 4)
q1_5 q6_9 participant_id
0 2.0 0.0 11
1 3.0 0.0 11
2 3.0 0.0 11
3 3.0 0.0 11
4 3.0 0.0 11
183 2.0 0.0 14
184 3.0 0.0 14
185 2.0 0.0 14
186 3.0 0.0 14
187 3.0 0.0 14
358 5.0 0.0 17
359 5.0 0.0 17
360 3.0 0.0 17
361 4.0 0.0 17
362 4.0 0.0 17
535 4.0 0.0 18
536 5.0 0.0 18
537 4.0 0.0 18
538 3.0 0.0 18
539 3.0 0.0 18
I want to do it with GroupBy (as I am learning pandas and want to get use to its intrinsic functions).
I tried to do something like:
df.groupby('participant_id').agg([lambda x: (x.q1_5/15.0).std(), lambda x: (x.q6_9/4.0).std()])
but it didn't work.
AttributeError: 'Series' object has no attribute 'q1_5'
QUESTIONS
Upvotes: 0
Views: 130
Reputation:
When you pass a list of functions to groupby.agg, without slicing a column, it iterates over all columns in the DataFrame (except for the grouping ones) and applies those functions. So it starts with lambda x: (x.q1_5/15.0).std()
, it tries to apply that to column q1_5
but your code translates to df['q1_5']['q1_5']
. Instead, if you want to specify different functions for different columns, you need to use a dictionary:
df.groupby('participant_id').agg({'q1_5': lambda x: (x/15.).std(),
'q6_9': lambda x: (x/4.).std()})
Out:
q6_9 q1_5
participant_id
11 0.0 0.029814
14 0.0 0.036515
17 0.0 0.055777
18 0.0 0.055777
For data having different ranges, there are many standardization options (min-max, z-score, computing coefficient of variation etc.) but which one to choose depends on your dataset.
Upvotes: 1