Cerin
Cerin

Reputation: 64739

Calculating the Coefficient of Determination in Python

I'm trying to calculate the coefficient of determination (R^2) in Python, but I'm getting a negative value in certain cases. Is this a sign that there's an error in my calculation? I thought R^2 should be bounded between 0 and 1.

Here's my Python code for doing the calculation, adapted straight from the WP article:

>>> yi_list = [1, 1, 63, 63, 5, 5, 124, 124]
>>> fi_list = [1.7438055421354988, 2.3153069186947639, 1002.7093097555808, 63.097699219524706, 6.2635465467410842, 7.2275532522971364, 17.55393551900103, 40.8570]
>>> y_mean = sum(yi_list)/float(len(yi_list))
>>> ss_tot = sum((yi-y_mean)**2 for yi in yi_list)
>>> ss_err = sum((yi-fi)**2 for yi,fi in zip(yi_list,fi_list))
>>> r2 = 1 - (ss_err/ss_tot)
>>> r2
-43.802085810924964

Upvotes: 3

Views: 5495

Answers (4)

Amjad
Amjad

Reputation: 3678

Here is a function that calculates the coefficient of determination in python:

import numpy as np

def rSquare(estimations, measureds):
    """ Compute the coefficient of determination of random data. 
    This metric gives the level of confidence about the model used to model data"""
    SEE =  ((np.array(measureds) - np.array(estimations))**2).sum()
    mMean = (np.array(measureds)).sum() / float(len(measureds))
    dErr = ((mMean - measureds)**2).sum()
    
    return 1 - (SEE / dErr)

Upvotes: 3

mb14
mb14

Reputation: 22596

No, no error in the formulat. Your value are not correlated whatsoever (look at y3 and f3 : 63 and 1002).

Just to show you that R2 is not bound to 0,1 imagine one of the f is near infinite . Serr will be near infinite too, so R2 near -infinite.

Are you not getting confused between X and Y value ?

(sorry for the "near infinite" bit, but I don't know how to say it better in english)

Upvotes: 1

David Webb
David Webb

Reputation: 193716

Your implementation of the calculation as shown in the Wikipedia article looks OK to me.

According to the Wikipedia article:

Values of R2 outside the range 0 to 1 can occur where it is used to measure the agreement between observed and modelled values and where the "modelled" values are not obtained by linear regression and depending on which formulation of R2 is used.

Looking at your data, the expected-modelled pair of 63 and 1002.7093097555808 are probably the main source of the large variance.

Upvotes: 4

neil
neil

Reputation: 3635

Looking at the article, I think this is expected behaviour given the input data. In the introduction it says:

Important cases where the computational definition of R2 can yield negative values, depending on the definition used, arise where the predictions which are being compared to the corresponding outcome have not derived from a model-fitting procedure using those data.

I can't see anything in the formulae that would mean it would always be in the range 0-1.

Upvotes: 1

Related Questions