Reputation: 718
I'm new in machine learning, and trying to implement linear model estimators that provide Scikit to predict price of the used car. I used different combinations of linear models, like LinearRegression
, Ridge
, Lasso
and Elastic Net
, but all of them in most cases return negative score (-0.6 <= score <= 0.1).
Someone told me that this is because of multicollinearity problem, but I don't know how to solve it.
My sample code:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sqlalchemy import create_engine
from sklearn.linear_model import Ridge
engine = create_engine('sqlite:///path-to-db')
query = "SELECT mileage, carcass, engine, transmission, state, drive, customs_cleared, price FROM cars WHERE mark='some mark' AND model='some model' AND year='some year'"
df = pd.read_sql_query(query, engine)
df = df.dropna()
df = df.reindex(np.random.permutation(df.index))
X_full = df[['mileage', 'carcass', 'engine', 'transmission', 'state', 'drive', 'customs_cleared']]
y_full = df['price']
n_train = -len(X_full)/5
X_train = X_full[:n_train]
X_test = X_full[n_train:]
y_train = y_full[:n_train]
y_test = y_full[n_train:]
predict = [200000, 0, 2.5, 0, 0, 2, 0] # parameters of the car to predict
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
y_estimate = model.predict(X_test)
print("Residual sum of squares: %.2f" % np.mean((y_estimate - y_test) ** 2))
print("Variance score: %.2f" % model.score(X_test, y_test))
print("Predicted price: ", model.predict(predict))
Carcass, state, drive and customs cleared are numeric and represent types.
What is correct way to implement prediction? Maybe some data preprocessing or different algorithm.
Thanks for any advance!
Upvotes: 6
Views: 9449
Reputation: 2487
Given that you are using Ridge Regression, you should scale your variables using StandardScaler, or MinMaxScaler:
Perhaps using a Pipeline:
http://scikit-learn.org/stable/modules/pipeline.html#pipeline-chaining-estimators
If you were using vanilla Regression, scaling wouldn't matter; but with Ridge Regression, the regularization penalty term (alpha) will treat differently scaled variables differently. See this discussion on stats:
Upvotes: 3