nilsinelabore
nilsinelabore

Reputation: 5095

How to standardise in place in pandas

I have a wide dataset:

id    x0  x1    x2   x3   x4   x5   ... x10000  Type 
1    40  31.05 25.5 25.5 25.5 25   ...  33      1    
2    35  35.75 36.5 26.5 36.5 36.5 ...  29      0
3    35  35.70 36.5 36.5 36.5 36.5 ...  29      1 
4    40  31.50 23.5 24.5 26.5 25   ...  33      1
 ... 
900  40  31.05 25.5 25.5 25.5 25   ...  23      0

with each row being a time series. I would like to standardise in place all values except for the last column, with each row/time series as an independent distribution. I am thinking about appending 2 columns mean and std(standard deviation) to the rightmost of the dataframe, and standardise using apply. But it sounds cumbersome and might make mistakes in the process. How can I do this and is there an easier way? Thanks

Upvotes: 1

Views: 112

Answers (2)

Travis
Travis

Reputation: 1241

Method 1:

We can use sklearn.preprocessing.scale! Set axis = 1 to scale data on each row!

This kind of data cleaning can be done nicely using sklearn.preprocessing.Here is an official doc

Code:

# Generate data
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
data = pd.DataFrame({'A':np.random.randint(5,15,100),'B':np.random.randint(1,10,100),
                     'C':np.random.randint(0,10,100),'type':np.random.randint(0,2,100)})
data.head()

# filter columns and then standardlize inplace
data.loc[:,~data.columns.isin(['type'])] = scale(data.loc[:,~data.columns.isin(['type'])], axis = 1)
data.head()

Output:

    A   B   C   type
0   12  8   2   0
1   5   2   9   1
2   14  5   2   1
3   5   7   6   0
4   8   1   4   0
    A           B           C           type
0   1.135550    0.162221    -1.297771   0
1   -0.116248   -1.162476   1.278724    1
2   1.372813    -0.392232   -0.980581   1
3   -1.224745   1.224745    0.000000    0
4   1.278724    -1.162476   -0.116248   0

Method 2:

Just use lambda function if your dataset is not huge.

Code:

# Generate data
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
data = pd.DataFrame({'A':np.random.randint(5,15,100),'B':np.random.randint(1,10,100),
                     'C':np.random.randint(0,10,100),'type':np.random.randint(0,2,100)})
data.head()

# filter columns and than standardlize inplace
data.loc[:,~data.columns.isin(['type'])] = data.loc[:,~data.columns.isin(['type'])].\
apply(lambda x: (x  - np.mean(x))/np.std(x), axis = 1)
data.head()

Output:

    A   B   C   type
0   12  8   2   0
1   5   2   9   1
2   14  5   2   1
3   5   7   6   0
4   8   1   4   0
    A           B           C           type
0   1.135550    0.162221    -1.297771   0
1   -0.116248   -1.162476   1.278724    1
2   1.372813    -0.392232   -0.980581   1
3   -1.224745   1.224745    0.000000    0
4   1.278724    -1.162476   -0.116248   0

Speed compare:

Method 1 is faster then method 2.

Method 1: 2.03 ms ± 205 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)

%%timeit -r 100 -n 100
data.loc[:,~data.columns.isin(['type'])] = scale(data.loc[:,~data.columns.isin(['type'])], axis = 1)

Method 2: 3.06 ms ± 153 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)

%%timeit -r 100 -n 100
data.loc[:,~data.columns.isin(['type'])].apply(lambda x: (x  - np.mean(x))/np.std(x), axis = 0)

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150735

You could compute mean and std manually:

stats = df.iloc[:,1:-1].agg(['mean','std'], axis=1)     # axis=1 apply on rows
df.iloc[:, 1:-1] = (df.iloc[:, 1:-1]
                      .sub(stats['mean'], axis='rows')  # axis='rows' apply on rows
                      .div(stats['std'],axis='rows')
                   )

output:

      id          x0        x1         x2         x3         x4         x5     x10000    Type
 0     1   1.87515    0.297204  -0.681302  -0.681302  -0.681302  -0.769456   0.641003       1
 1     2   0.31841    0.499129   0.679848  -1.72974    0.679848   0.679848  -1.12734        0
 2     3  -0.0363456  0.218074   0.508839   0.508839   0.508839   0.508839  -2.21708        1
 3     4   1.81012    0.392987  -0.940787  -0.774066  -0.440622  -0.690705   0.64307        1

Upvotes: 0

Related Questions