Hadi
Hadi

Reputation: 133

pandas apply custom function to every row of one column grouped by another column

I have a dataframe containing two columns: id and val.

df = pd.DataFrame ({'id': [1,1,1,2,2,2,3,3,3,3], 'val' : np.random.randn(10)})

   id       val
0   1  2.644347
1   1  0.378770
2   1 -2.107230
3   2 -0.043051
4   2  0.115948
5   2  0.054485
6   3  0.574845
7   3 -0.228612
8   3 -2.648036
9   3  0.569929

And I want to apply a custom function to every val according to id. Let's say I want to apply min-max scaling. This is how I would do it using a for loop:

df['scaled']=0
ids = df.id.drop_duplicates()
for i in range(len(ids)):
    df1 = df[df.id==ids.iloc[i]]
    df1['scaled'] = (df1.val-df1.val.min())/(df1.val.max()-df1.val.min())
    df.loc[df.id==ids.iloc[i],'scaled'] = df1['scaled']

And the result is:

   id       val    scaled
0   1  0.457713  1.000000
1   1 -0.464513  0.000000
2   1  0.216352  0.738285
3   2  0.633652  0.990656
4   2 -1.099065  0.000000
5   2  0.649995  1.000000
6   3 -0.251099  0.306631
7   3 -1.003295  0.081387
8   3  2.064389  1.000000
9   3 -1.275086  0.000000

How can I do this faster without a loop?

Upvotes: 1

Views: 609

Answers (2)

Brad Solomon
Brad Solomon

Reputation: 40878

You can do this with groupby:

In [6]: def minmaxscale(s): return (s - s.min()) / (s.max() - s.min())                                                                                           

In [7]: df.groupby('id')['val'].apply(minmaxscale)                                                                                                            
Out[7]: 
0    0.000000
1    1.000000
2    0.654490
3    1.000000
4    0.524256
5    0.000000
6    0.000000
7    0.100238
8    0.014697
9    1.000000
Name: val, dtype: float64

(Note that np.ptp() / peak-to-peak can be used in placed of s.max() - s.min().)

This applies the function minmaxscale() to each smaller-sized Series of val, grouped by id.

Taking the first group, for example:

In [11]: s = df[df.id == 1]['val']                                                                                                                            

In [12]: s                                                                                                                                                    
Out[12]: 
0    0.002722
1    0.656233
2    0.430438
Name: val, dtype: float64

In [13]: s.max() - s.min()                                                                                                                                    
Out[13]: 0.6535106879021447

In [14]: (s - s.min()) / (s.max() - s.min())                                                                                                                  
Out[14]: 
0    0.00000
1    1.00000
2    0.65449
Name: val, dtype: float64

Upvotes: 3

BENY
BENY

Reputation: 323226

Solution from sklearn MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['new']=np.concatenate([scaler.fit_transform(x.values.reshape(-1,1)) for y, x in df.groupby('id').val])
df
Out[271]: 
   id       val    scaled       new
0   1  0.457713  1.000000  1.000000
1   1 -0.464513  0.000000  0.000000
2   1  0.216352  0.738285  0.738284
3   2  0.633652  0.990656  0.990656
4   2 -1.099065  0.000000  0.000000
5   2  0.649995  1.000000  1.000000
6   3 -0.251099  0.306631  0.306631
7   3 -1.003295  0.081387  0.081387
8   3  2.064389  1.000000  1.000000
9   3 -1.275086  0.000000  0.000000

Upvotes: 2

Related Questions