Reputation: 143
Scaling converts different columns with different values alike example Standard Scaler but when building a model out of it, the values which were different earlier are converted to same values with mean=0 and std = 1, so it should affect the model fit and results.
I have taken a toy pandas dataframe with 1st column starting from 1 to 10 and 2nd column starting from 5 to 14 and scaled both using Standard Scaler.
import pandas as pd
ls1 = np.arange(1,10)
ls2 = np.arange(5,14)
before_scaling= pd.DataFrame()
before_scaling['a'] = ls1
before_scaling['b'] = ls2
'''
a b
0 1 5
1 2 6
2 3 7
3 4 8
4 5 9
5 6 10
6 7 11
7 8 12
8 9 13
'''
from sklearn.preprocessing import StandardScaler,MinMaxScaler
ss = StandardScaler()
after_scaling = pd.DataFrame(ss.fit_transform(before_scaling),columns=
['a','b'])
'''
a b
0 -1.549193 -1.549193
1 -1.161895 -1.161895
2 -0.774597 -0.774597
3 -0.387298 -0.387298
4 0.000000 0.000000
5 0.387298 0.387298
6 0.774597 0.774597
7 1.161895 1.161895
8 1.549193 1.549193
'''
If there is a regression model to be built using the above 2 independent variables then i believe that fitting the model ( Linear regression ) will produce different fit and results using the dataframe on before_scaling and after_scaling dataframes. If yes, then why we use feature Scaling and if we use feature scaling on individual columns one by one then also it will produce same results
Upvotes: 1
Views: 1417
Reputation: 143
After waiting for some time and not getting my answer , i tried it myself and now i got the answer. After Scaling although the different columns may have the same value if the distribution is same for these columns. The reason why the model able to retain the same results with changed features values after scaling is because the model changes the weights of coefficients.
# After scaling with Standard Scaler
b = -1.38777878e-17
t = 0.5 * X_a[0,0] + 0.5 * X_a[0,1] + b
t = np.array(t).reshape(-1,1)
sc2.inverse_transform(t)
# out 31.5
'''
X_a
array([[-1.64750894, -1.64750894],
[-1.47408695, -1.47408695],
[-1.30066495, -1.30066495],
[-1.12724296, -1.12724296],
[-0.95382097, -0.95382097],
[-0.78039897, -0.78039897],
[-0.60697698, -0.60697698],
[-0.43355498, -0.43355498],
[-0.26013299, -0.26013299],
[-0.086711 , -0.086711 ],
[ 0.086711 , 0.086711 ],
[ 0.26013299, 0.26013299],
[ 0.43355498, 0.43355498],
[ 0.60697698, 0.60697698],
[ 0.78039897, 0.78039897],
[ 0.95382097, 0.95382097],
[ 1.12724296, 1.12724296],
[ 1.30066495, 1.30066495],
[ 1.47408695, 1.47408695],
[ 1.64750894, 1.64750894]])
'''
# Before scaling
2.25 * X_b[0,0] + 2.25 * X_b[0,1] + 6.75
# out 31.5
'''
X_b
array([[ 1, 10],
[ 2, 11],
[ 3, 12],
[ 4, 13],
[ 5, 14],
[ 6, 15],
[ 7, 16],
[ 8, 17],
[ 9, 18],
[10, 19],
[11, 20],
[12, 21],
[13, 22],
[14, 23],
[15, 24],
[16, 25],
[17, 26],
[18, 27],
[19, 28],
[20, 29]], dtype=int64)
'''
Upvotes: 0
Reputation: 3624
This happening because the fit_transform
function work as follow:
For each feature you have ('a', 'b' in your case) apply this equation:
X = (X - MEAN) / STD
where MEAN is the mean of the feature and STD is the standared diviation.
The first feature a
has a mean of '5' and std of '2.738613', while feature b
has mean of '9' and std of '2.738613'. So if you subtract from each value the mean of its corresponding feature you will have two identical features and as we have the std equal in both features you will end up with identical transformation.
before_scaling['a'] = before_scaling['a'] - before_scaling['a'].mean()
before_scaling['b'] = before_scaling['b'] - before_scaling['b'].mean()
print(before_scaling)
a b
0 -4.0 -4.0
1 -3.0 -3.0
2 -2.0 -2.0
3 -1.0 -1.0
4 0.0 0.0
5 1.0 1.0
6 2.0 2.0
7 3.0 3.0
8 4.0 4.0
Finally be aware that the last value in the arange
function is not included.
Upvotes: 1