Reputation:
Say I have a data set like this,
X | Y |
---|---|
2000 | 900000 |
1000 | 78991891 |
9000 | 7868141891 |
8000 | 78931891 |
I can standardize the X variable part using the formula--> {X-mean(X)/std_deviation(X)}, which will feature scale the X variable values.
Now say I have a data set like this,
X1 | X2 | Y |
---|---|---|
19 | 19000 | 0 |
35 | 20000 | 1 |
26 | 50000 | 1 |
27 | 90000 | 0 |
Here my independent variables are X={X1,X2} and I want to feature scale them. How to perform feature scale on two variables together? I am not asking for a code snippet but looking for the mathematics that does it.
I have tried 2 things on my own,
I have tried calculate the standardization from the entire X1 and X2 assuming them as a single variable X, hence the mean was the summation of (X1+X2)/length(X1+X2) but this is not matching with the answer yielded by the python library.
from sklearn.preprocessing import StandardScaler as SC
sc_X=SC()
X=sc_X.fit_transform(X)
I have tried to standardize X1 and X2 separately, but that and is not also matching with the python library's output.
So my question is, How is standardization computed when we have 2 or more independent variables?
Upvotes: 0
Views: 896
Reputation: 2316
According to sklearn documentation: "Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set".
Here is a sample code how to check this:
import pandas as pd
from sklearn.preprocessing import StandardScaler
df = pd.DataFrame(data=[[19, 19000], [35, 20000], [26, 50000], [27, 90000]],
columns=['x1', 'x2'])
sc = StandardScaler()
sc.fit(df)
print(sc.mean_)
Upvotes: 1