bugfoot
bugfoot

Reputation: 667

Does SKLearn LinearRegression prediction result depend on column order?

It seems that SKLearn LinearRegression prediction result depends on column order of X_train (and X_test), although in my understanding OLS linear regression solution should be independent of it:

import pandas as pd

from sklearn.linear_model import LinearRegression

X_train = pd.DataFrame({
  'x2': [0.41881871483604843, 0.41881871483604843, 0.41881871483604843, -2.2128066838437888, 0.41881871483604843],
  'x1': [0.3226465587013849, 0.3226465587013849, 0.3226465587013849, -2.1432281979935226, 0.3226465587013849],
  'x3': [0.41881871483604843, 0.41881871483604843, 0.41881871483604843, -2.2128066838437888, 0.41881871483604843]
})

y_train = pd.Series([0.00208714705719199, 0.0, 0.0373802794439473, 0.4751917903756102, 0.01156975729482886])

X_test = pd.DataFrame({
  'x2': [0.6718361093920282, 0.39636690075505104, 0.4225844259460428, 0.4225844259460428, 0.6991034460436102],
  'x1': [1.417088758155678, 0.25726707774120766, 0.25726707774120766, 0.25726707774120766, 1.417088758155678],
  'x3': [0.6718361093920282, 0.39636690075505104, 0.4225844259460428, 0.4225844259460428,0.6991034460436102]
})

y_test = pd.Series([0.21970766666406633, 0.1452871258871291, 0.08888275135771367, 0.08914350635018843, 0.04924794822392303])

model = LinearRegression().fit(X_train, y_train)

yhat_train = model.predict(X_train)
yhat_test = model.predict(X_test)

# Sort columns.

cols = sorted(X_train.columns)

sorted_X_train = X_train[cols].copy()
sorted_X_test = X_test[cols].copy()

sorted_model = LinearRegression()
sorted_model = sorted_model.fit(sorted_X_train, y_train)

sorted_yhat_train = sorted_model.predict(sorted_X_train)
sorted_yhat_test = sorted_model.predict(sorted_X_test)

print(f'yhat_test       : {yhat_test}')
print(f'sorted_yhat_test: {sorted_yhat_test}')

Results in:

yhat_test       : [-8.13124851e+12  4.20539351e+11  6.53526629e+11  6.53526629e+11
 -7.88893187e+12]
sorted_yhat_test: [-0.08075183  0.0192414   0.01603989  0.01603989 -0.08408154]

The coefficients are also different (in value too, not just in order). What am I doing wrong here?

Upvotes: 1

Views: 397

Answers (1)

Ben Reiniger
Ben Reiniger

Reputation: 12602

Your feature space contains multicolinearity, so there is no unique solution to the OLS problem, and perhaps it's not a surprise that small changes like column order affect which solution is chosen.

However, there are some weird things going on. LinearRegression uses scipy.linalg.lstsq under the hood to solve the OLS. Calling lstsq directly on your example, I get coefficients that are identical (reordered, of course) for the two inputs! sklearn does run _preprocess_data first, centering and scaling the data. Doing that manually, I can confirm that the outputs are as expected just reorderings of each other, but now calling lstsq on these two I get different coefficients! What's more, the rank is different! Understanding that difference probably goes into the LAPACK drivers, which is beyond my expertise.

Upvotes: 1

Related Questions