Reputation: 218
I have a problem where I've got a training dataset where all the Y values are 0.75 and my model is predicting scores for each row as a regression - but when calculating r2 it's zero and I can't see why
I found only 1 other similar question (Scikit-learn R2 always zero) but applying the answer given there isn't helping me, so I'm not sure where I'm going wrong.
What I have is this:
df["Score"] = 0.75
Y = df["Score"]
df_valid = df.drop(["Score"],1)
y_pred = model.predict(df) #model is random forest regressor from sklearn
prediction = np.array(y_pred)
training = np.array(Y)
print(prediction)
print(training)
[0.77279743 0.18169051 0.81874664 0.75440987 0.67748983 0.56747803
0.66120282 0.5829188 0.73471978 0.57745964 0.48272321 0.65313173
0.805028 0.63791055 0.49677642 0.64341235 0.55456506 0.52329214
0.67690119 0.79450821 0.63378986 0.69522612 0.69802982 0.6719472
0.67977281 0.29016943 0.56192242 0.16265814 0.57813068 0.72598279
0.50255597 0.77138968 0.53745061 0.527479 0.67161703 0.64326146
0.5299367 0.79977403 0.73527391 0.50858258 0.74660319 0.72315073
0.71879784 0.55134538 0.61812615 0.64722909 0.67055658 0.68687499
0.73416035 0.4781765 0.74878142 0.5773583 ]
[0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75
0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75
0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75
0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75]
both prediction
and training
are numpy arrays of the same shape - am I missing something else?
When I try print(r2_score(training, prediction))
it gives me 0.
Upvotes: 0
Views: 2947
Reputation: 46908
R-squared is basically the proportion of variance explained by the model, you can see the first line of wiki :
In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
Your actual or observed values consist of only one value, there is no variance to speak of, so why you want to measure R^2?
You might be trying to check or measure something else how well your model can predict values that have a similar observed value, but taking R^2 on this subset does not make sense
Upvotes: 2
Reputation: 2042
R2 score will be 0 when y_predicted
or y_true
is always the same value. In your case, you have always the same y_true
.
Going deeper to the formula, R2 is calculated:
And SStot is calculated as:
SStot= y_true - ymean
In your case, your y_true - y mean will be always 0, since (0.75-0.75=0). So When calculating R2 you finding problem dividing by 0.
On the other hand, if you have the same value for y predicted, SSres and SStot would be the same, and your R2 would be also 0.
Consult this link for more information of how calculate R2, it is pretty well explained
Upvotes: 4