Does SKLearn LinearRegression prediction result depend on column order?

Question

It seems that SKLearn LinearRegression prediction result depends on column order of X_train (and X_test), although in my understanding OLS linear regression solution should be independent of it:

import pandas as pd

from sklearn.linear_model import LinearRegression

X_train = pd.DataFrame({
  'x2': [0.41881871483604843, 0.41881871483604843, 0.41881871483604843, -2.2128066838437888, 0.41881871483604843],
  'x1': [0.3226465587013849, 0.3226465587013849, 0.3226465587013849, -2.1432281979935226, 0.3226465587013849],
  'x3': [0.41881871483604843, 0.41881871483604843, 0.41881871483604843, -2.2128066838437888, 0.41881871483604843]
})

y_train = pd.Series([0.00208714705719199, 0.0, 0.0373802794439473, 0.4751917903756102, 0.01156975729482886])

X_test = pd.DataFrame({
  'x2': [0.6718361093920282, 0.39636690075505104, 0.4225844259460428, 0.4225844259460428, 0.6991034460436102],
  'x1': [1.417088758155678, 0.25726707774120766, 0.25726707774120766, 0.25726707774120766, 1.417088758155678],
  'x3': [0.6718361093920282, 0.39636690075505104, 0.4225844259460428, 0.4225844259460428,0.6991034460436102]
})

y_test = pd.Series([0.21970766666406633, 0.1452871258871291, 0.08888275135771367, 0.08914350635018843, 0.04924794822392303])

model = LinearRegression().fit(X_train, y_train)

yhat_train = model.predict(X_train)
yhat_test = model.predict(X_test)

# Sort columns.

cols = sorted(X_train.columns)

sorted_X_train = X_train[cols].copy()
sorted_X_test = X_test[cols].copy()

sorted_model = LinearRegression()
sorted_model = sorted_model.fit(sorted_X_train, y_train)

sorted_yhat_train = sorted_model.predict(sorted_X_train)
sorted_yhat_test = sorted_model.predict(sorted_X_test)

print(f'yhat_test       : {yhat_test}')
print(f'sorted_yhat_test: {sorted_yhat_test}')

Results in:

yhat_test       : [-8.13124851e+12  4.20539351e+11  6.53526629e+11  6.53526629e+11
 -7.88893187e+12]
sorted_yhat_test: [-0.08075183  0.0192414   0.01603989  0.01603989 -0.08408154]

The coefficients are also different (in value too, not just in order). What am I doing wrong here?

Does SKLearn LinearRegression prediction result depend on column order?

Answers (1)

Related Questions