Reputation: 5095
I have a wide dataset:
id x0 x1 x2 x3 x4 x5 ... x10000 Type
1 40 31.05 25.5 25.5 25.5 25 ... 33 1
2 35 35.75 36.5 26.5 36.5 36.5 ... 29 0
3 35 35.70 36.5 36.5 36.5 36.5 ... 29 1
4 40 31.50 23.5 24.5 26.5 25 ... 33 1
...
900 40 31.05 25.5 25.5 25.5 25 ... 23 0
with each row being a time series. I would like to standardise in place all values except for the last column, with each row/time series as an independent distribution. I am thinking about appending 2 columns mean
and std
(standard deviation) to the rightmost of the dataframe, and standardise using apply
. But it sounds cumbersome and might make mistakes in the process. How can I do this and is there an easier way? Thanks
Upvotes: 1
Views: 112
Reputation: 1241
We can use sklearn.preprocessing.scale
! Set axis = 1
to scale data on each row!
This kind of data cleaning can be done nicely using sklearn.preprocessing.Here is an official doc
Code:
# Generate data
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
data = pd.DataFrame({'A':np.random.randint(5,15,100),'B':np.random.randint(1,10,100),
'C':np.random.randint(0,10,100),'type':np.random.randint(0,2,100)})
data.head()
# filter columns and then standardlize inplace
data.loc[:,~data.columns.isin(['type'])] = scale(data.loc[:,~data.columns.isin(['type'])], axis = 1)
data.head()
Output:
A B C type 0 12 8 2 0 1 5 2 9 1 2 14 5 2 1 3 5 7 6 0 4 8 1 4 0 A B C type 0 1.135550 0.162221 -1.297771 0 1 -0.116248 -1.162476 1.278724 1 2 1.372813 -0.392232 -0.980581 1 3 -1.224745 1.224745 0.000000 0 4 1.278724 -1.162476 -0.116248 0
Just use lambda function if your dataset is not huge.
Code:
# Generate data
import pandas as pd
import numpy as np
from sklearn.preprocessing import scale
data = pd.DataFrame({'A':np.random.randint(5,15,100),'B':np.random.randint(1,10,100),
'C':np.random.randint(0,10,100),'type':np.random.randint(0,2,100)})
data.head()
# filter columns and than standardlize inplace
data.loc[:,~data.columns.isin(['type'])] = data.loc[:,~data.columns.isin(['type'])].\
apply(lambda x: (x - np.mean(x))/np.std(x), axis = 1)
data.head()
Output:
A B C type 0 12 8 2 0 1 5 2 9 1 2 14 5 2 1 3 5 7 6 0 4 8 1 4 0 A B C type 0 1.135550 0.162221 -1.297771 0 1 -0.116248 -1.162476 1.278724 1 2 1.372813 -0.392232 -0.980581 1 3 -1.224745 1.224745 0.000000 0 4 1.278724 -1.162476 -0.116248 0
Method 1 is faster then method 2.
Method 1: 2.03 ms ± 205 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)
%%timeit -r 100 -n 100
data.loc[:,~data.columns.isin(['type'])] = scale(data.loc[:,~data.columns.isin(['type'])], axis = 1)
Method 2: 3.06 ms ± 153 µs per loop (mean ± std. dev. of 100 runs, 100 loops each)
%%timeit -r 100 -n 100
data.loc[:,~data.columns.isin(['type'])].apply(lambda x: (x - np.mean(x))/np.std(x), axis = 0)
Upvotes: 2
Reputation: 150735
You could compute mean
and std
manually:
stats = df.iloc[:,1:-1].agg(['mean','std'], axis=1) # axis=1 apply on rows
df.iloc[:, 1:-1] = (df.iloc[:, 1:-1]
.sub(stats['mean'], axis='rows') # axis='rows' apply on rows
.div(stats['std'],axis='rows')
)
output:
id x0 x1 x2 x3 x4 x5 x10000 Type
0 1 1.87515 0.297204 -0.681302 -0.681302 -0.681302 -0.769456 0.641003 1
1 2 0.31841 0.499129 0.679848 -1.72974 0.679848 0.679848 -1.12734 0
2 3 -0.0363456 0.218074 0.508839 0.508839 0.508839 0.508839 -2.21708 1
3 4 1.81012 0.392987 -0.940787 -0.774066 -0.440622 -0.690705 0.64307 1
Upvotes: 0