DN1
DN1

Reputation: 218

Why is scikit-learn r2 score zero?

I have a problem where I've got a training dataset where all the Y values are 0.75 and my model is predicting scores for each row as a regression - but when calculating r2 it's zero and I can't see why

I found only 1 other similar question (Scikit-learn R2 always zero) but applying the answer given there isn't helping me, so I'm not sure where I'm going wrong.

What I have is this:

df["Score"] = 0.75
Y = df["Score"] 
df_valid = df.drop(["Score"],1)

y_pred = model.predict(df) #model is random forest regressor from sklearn 

prediction = np.array(y_pred)
training = np.array(Y)

print(prediction)
print(training)


[0.77279743 0.18169051 0.81874664 0.75440987 0.67748983 0.56747803
 0.66120282 0.5829188  0.73471978 0.57745964 0.48272321 0.65313173
 0.805028   0.63791055 0.49677642 0.64341235 0.55456506 0.52329214
 0.67690119 0.79450821 0.63378986 0.69522612 0.69802982 0.6719472
 0.67977281 0.29016943 0.56192242 0.16265814 0.57813068 0.72598279
 0.50255597 0.77138968 0.53745061 0.527479   0.67161703 0.64326146
 0.5299367  0.79977403 0.73527391 0.50858258 0.74660319 0.72315073
 0.71879784 0.55134538 0.61812615 0.64722909 0.67055658 0.68687499
 0.73416035 0.4781765  0.74878142 0.5773583 ]
[0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75
 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75
 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75
 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75]

both prediction and training are numpy arrays of the same shape - am I missing something else?

When I try print(r2_score(training, prediction)) it gives me 0.

Upvotes: 0

Views: 2947

Answers (2)

StupidWolf
StupidWolf

Reputation: 46908

R-squared is basically the proportion of variance explained by the model, you can see the first line of wiki :

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

Your actual or observed values consist of only one value, there is no variance to speak of, so why you want to measure R^2?

You might be trying to check or measure something else how well your model can predict values that have a similar observed value, but taking R^2 on this subset does not make sense

Upvotes: 2

Alex Serra Marrugat
Alex Serra Marrugat

Reputation: 2042

R2 score will be 0 when y_predicted or y_true is always the same value. In your case, you have always the same y_true.

Going deeper to the formula, R2 is calculated:

enter image description here

And SStot is calculated as:

SStot= y_true - ymean

In your case, your y_true - y mean will be always 0, since (0.75-0.75=0). So When calculating R2 you finding problem dividing by 0.

On the other hand, if you have the same value for y predicted, SSres and SStot would be the same, and your R2 would be also 0.

Consult this link for more information of how calculate R2, it is pretty well explained

Upvotes: 4

Related Questions