Reputation: 639
I have performed a ridge regression model on a data set (link to the dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) as below:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
y = train['SalePrice']
X = train.drop("SalePrice", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train,y_train)
pred = ridge.predict(X_test)
I calculated the MSE using the metrics library from sklearn as
from sklearn.metrics import mean_squared_error
mean = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test,pred)
I am getting a very large value of MSE = 554084039.54321
and RMSE = 21821.8
, I am trying to understand if my implementation is correct.
Upvotes: 2
Views: 8954
Reputation: 1
It's also possible to change 'squared' parameter.
squared: bool, default=True If True returns MSE value, if False returns RMSE value.
Upvotes: 0
Reputation: 24681
Your RMSE
implementation is correct which is easily verifiable when you take the sqaure root of sklearn's mean_squared_error
.
I think you are missing a closing parentheses though, here to be exact:
rmse = np.sqrt(mean_squared_error(y_test,pred)) # the last one was missing
Your MSE is high due to model not being able to model relationships between your variables and target very well. Bear in mind each error is taken to the power of 2, so being 1000
off in price sky-rockets the value to 1000000
.
You may want to modify the price with natural logarithm (numpy.log
) and transform it to log-scale, it is a common practice especially for this problem (I assume you are doing House Prices: Advanced Regression Techniques), see available kernels for guidance. With this approach, you will not get such big values.
Last but not least, check Mean Absolute Error in order to see your predictions are not as terrible as they seem.
Upvotes: 3