Shyngys Kassymov
Shyngys Kassymov

Reputation: 718

When Scikit linear models return negative value for score?

I'm new in machine learning, and trying to implement linear model estimators that provide Scikit to predict price of the used car. I used different combinations of linear models, like LinearRegression, Ridge, Lasso and Elastic Net, but all of them in most cases return negative score (-0.6 <= score <= 0.1).

Someone told me that this is because of multicollinearity problem, but I don't know how to solve it.

My sample code:

import numpy as np
import pandas as pd
from sklearn import linear_model
from sqlalchemy import create_engine
from sklearn.linear_model import Ridge

engine = create_engine('sqlite:///path-to-db')

query = "SELECT mileage, carcass, engine, transmission, state, drive, customs_cleared, price FROM cars WHERE mark='some mark' AND model='some model' AND year='some year'"
df = pd.read_sql_query(query, engine)
df = df.dropna()
df = df.reindex(np.random.permutation(df.index))

X_full = df[['mileage', 'carcass', 'engine', 'transmission', 'state', 'drive', 'customs_cleared']]
y_full = df['price']

n_train = -len(X_full)/5
X_train = X_full[:n_train]
X_test = X_full[n_train:]
y_train = y_full[:n_train]
y_test = y_full[n_train:]

predict = [200000, 0, 2.5, 0, 0, 2, 0] # parameters of the car to predict

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
y_estimate = model.predict(X_test)

print("Residual sum of squares: %.2f" % np.mean((y_estimate - y_test) ** 2))
print("Variance score: %.2f" % model.score(X_test, y_test))
print("Predicted price: ", model.predict(predict))

Carcass, state, drive and customs cleared are numeric and represent types.

What is correct way to implement prediction? Maybe some data preprocessing or different algorithm.

Thanks for any advance!

Upvotes: 6

Views: 9449

Answers (1)

Andreus
Andreus

Reputation: 2487

Given that you are using Ridge Regression, you should scale your variables using StandardScaler, or MinMaxScaler:

http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling

Perhaps using a Pipeline:

http://scikit-learn.org/stable/modules/pipeline.html#pipeline-chaining-estimators

If you were using vanilla Regression, scaling wouldn't matter; but with Ridge Regression, the regularization penalty term (alpha) will treat differently scaled variables differently. See this discussion on stats:

https://stats.stackexchange.com/questions/29781/when-should-you-center-your-data-when-should-you-standardize

Upvotes: 3

Related Questions