Reputation: 3978
I am trying to predict output based on set of inputs using linear regression as below:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X = [[1, 1, 1, 1],
[1, 1, 1, 1],
[1, 2, 1, 1],
[1, 3, 1, 1],
[1, 4, 1, 1],
[1, 2, 1, 1],
[1, 3, 1, 1],
[2, 4, 1, 1],
[1, 1, 1, 1],
[2, 1, 1, 1],
[2, 4, 1, 1],
[1, 5, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1]]
y = [
[1],
[1],
[1],
[3],
[2],
[1],
[3],
[2],
[1],
[1],
[2],
[1],
[1],
[1],
]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
print(regression_model.score(X_test, y_test)) # -1.1817143658810325
print(regression_model.predict([[1, 1, 1, 1]]) # [[0.9694444444444441]]
I have passed X values as input and expecting y as output
It is displaying the score as negative values and predicted output as [[0.9694444444444441]] which I am expecting to be 1.
How can I solve this issue?
Upvotes: 1
Views: 1655
Reputation: 7111
A linear regression attempts to minimize the Mean Squared Error with an optimal hyperplane. Most data is not perfectly linear (including yours), so the predictions will not be perfect. However, they will have as low of an error as possible given the linearity constraint. In your example, there isn't much of a difference between 0.97 and 1.00.
Consider the following linear regression in a smaller number of dimensions to make visualization easier. All the regression does is choose the line that fits the data best. That doesn't mean it goes through every point. When you use that line to make a prediction, it will be off by a little bit.
The negative score (straight from the documentation) simply means that the model performs worse than if you did nothing but predict the average value of your data. Models can perform arbitrarily poorly. In your case, since a linear regression is able to learn such a constant model, this indicates overfitting to the training set (likely due to the small sample size). If you scored your train data instead you should get a non-negative answer, and probably one which is positive.
Examining your model a little more closely, you'll notice that anything with a true value of 1 is predicted relatively closely due to the large class imbalance (you have almost twice as many 1's as everything else put together). The 2's are a little worse, and the 3's have a horrendous prediction. A linear model has a tough time making the huge jump from 1's and 2's to 3's for just a couple of points stuck in the middle of the rest of the point cloud.
Upvotes: 3