Arnold Klein
Arnold Klein

Reputation: 3086

Application of different functions to Pandas columns via Groupby

I need to compute a standard deviation along columns (axis=0), but since two columns have different range (q1_5 = [0 - 15], q6_9 = [0-4]) , I must normalize by the max. value (column q1_5 by 15 and column q6_9 by 4)

      q1_5  q6_9  participant_id
0      2.0   0.0              11
1      3.0   0.0              11
2      3.0   0.0              11
3      3.0   0.0              11
4      3.0   0.0              11
183    2.0   0.0              14
184    3.0   0.0              14
185    2.0   0.0              14
186    3.0   0.0              14
187    3.0   0.0              14
358    5.0   0.0              17
359    5.0   0.0              17
360    3.0   0.0              17
361    4.0   0.0              17
362    4.0   0.0              17
535    4.0   0.0              18
536    5.0   0.0              18
537    4.0   0.0              18
538    3.0   0.0              18
539    3.0   0.0              18

I want to do it with GroupBy (as I am learning pandas and want to get use to its intrinsic functions).

I tried to do something like:

df.groupby('participant_id').agg([lambda x: (x.q1_5/15.0).std(), lambda x: (x.q6_9/4.0).std()])

but it didn't work.

AttributeError: 'Series' object has no attribute 'q1_5'

QUESTIONS

  1. To compare std() of two array with different range, should I normalize first?
  2. What's wrong in my solution?

Upvotes: 0

Views: 130

Answers (1)

user2285236
user2285236

Reputation:

When you pass a list of functions to groupby.agg, without slicing a column, it iterates over all columns in the DataFrame (except for the grouping ones) and applies those functions. So it starts with lambda x: (x.q1_5/15.0).std(), it tries to apply that to column q1_5 but your code translates to df['q1_5']['q1_5']. Instead, if you want to specify different functions for different columns, you need to use a dictionary:

df.groupby('participant_id').agg({'q1_5': lambda x: (x/15.).std(), 
                                  'q6_9': lambda x: (x/4.).std()})
Out: 
                q6_9      q1_5
participant_id                
11               0.0  0.029814
14               0.0  0.036515
17               0.0  0.055777
18               0.0  0.055777

For data having different ranges, there are many standardization options (min-max, z-score, computing coefficient of variation etc.) but which one to choose depends on your dataset.

Upvotes: 1

Related Questions