Reputation: 667
It seems that SKLearn LinearRegression prediction result depends on column order of X_train
(and X_test
), although in my understanding OLS linear regression solution should be independent of it:
import pandas as pd
from sklearn.linear_model import LinearRegression
X_train = pd.DataFrame({
'x2': [0.41881871483604843, 0.41881871483604843, 0.41881871483604843, -2.2128066838437888, 0.41881871483604843],
'x1': [0.3226465587013849, 0.3226465587013849, 0.3226465587013849, -2.1432281979935226, 0.3226465587013849],
'x3': [0.41881871483604843, 0.41881871483604843, 0.41881871483604843, -2.2128066838437888, 0.41881871483604843]
})
y_train = pd.Series([0.00208714705719199, 0.0, 0.0373802794439473, 0.4751917903756102, 0.01156975729482886])
X_test = pd.DataFrame({
'x2': [0.6718361093920282, 0.39636690075505104, 0.4225844259460428, 0.4225844259460428, 0.6991034460436102],
'x1': [1.417088758155678, 0.25726707774120766, 0.25726707774120766, 0.25726707774120766, 1.417088758155678],
'x3': [0.6718361093920282, 0.39636690075505104, 0.4225844259460428, 0.4225844259460428,0.6991034460436102]
})
y_test = pd.Series([0.21970766666406633, 0.1452871258871291, 0.08888275135771367, 0.08914350635018843, 0.04924794822392303])
model = LinearRegression().fit(X_train, y_train)
yhat_train = model.predict(X_train)
yhat_test = model.predict(X_test)
# Sort columns.
cols = sorted(X_train.columns)
sorted_X_train = X_train[cols].copy()
sorted_X_test = X_test[cols].copy()
sorted_model = LinearRegression()
sorted_model = sorted_model.fit(sorted_X_train, y_train)
sorted_yhat_train = sorted_model.predict(sorted_X_train)
sorted_yhat_test = sorted_model.predict(sorted_X_test)
print(f'yhat_test : {yhat_test}')
print(f'sorted_yhat_test: {sorted_yhat_test}')
Results in:
yhat_test : [-8.13124851e+12 4.20539351e+11 6.53526629e+11 6.53526629e+11
-7.88893187e+12]
sorted_yhat_test: [-0.08075183 0.0192414 0.01603989 0.01603989 -0.08408154]
The coefficients are also different (in value too, not just in order). What am I doing wrong here?
Upvotes: 1
Views: 397
Reputation: 12602
Your feature space contains multicolinearity, so there is no unique solution to the OLS problem, and perhaps it's not a surprise that small changes like column order affect which solution is chosen.
However, there are some weird things going on. LinearRegression
uses scipy.linalg.lstsq
under the hood to solve the OLS. Calling lstsq
directly on your example, I get coefficients that are identical (reordered, of course) for the two inputs! sklearn
does run _preprocess_data
first, centering and scaling the data. Doing that manually, I can confirm that the outputs are as expected just reorderings of each other, but now calling lstsq
on these two I get different coefficients! What's more, the rank is different! Understanding that difference probably goes into the LAPACK drivers, which is beyond my expertise.
Upvotes: 1