Reputation: 5648
I'm using Python and Numpy to calculate a best fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (linear, quadratic, etc.).
This much works, but I also want to calculate r (coefficient of correlation) and r-squared(coefficient of determination). I am comparing my results with Excel's best-fit trendline capability, and the r-squared value it calculates. Using this, I know I am calculating r-squared correctly for linear best-fit (degree equals 1). However, my function does not work for polynomials with degree greater than 1.
Excel is able to do this. How do I calculate r-squared for higher-order polynomials using Numpy?
Here's my function:
import numpy
# Polynomial Regression
def polyfit(x, y, degree):
results = {}
coeffs = numpy.polyfit(x, y, degree)
# Polynomial Coefficients
results['polynomial'] = coeffs.tolist()
correlation = numpy.corrcoef(x, y)[0,1]
# r
results['correlation'] = correlation
# r-squared
results['determination'] = correlation**2
return results
Upvotes: 141
Views: 483907
Reputation: 813
you can do it directly in numpy==1.26 without any extra libraries by passing full=True
:
np.polyfit(xs, ys, 1, full=True)
Example:
# make 5 (x, y) points with fake variation
xs = np.linspace(0, 1, 5)
ys = np.array([0, 0.22, 0.55, 0.70, 1])
# ss_r: residuals – sum of squared residuals of the least squares fit.
# ss_r will be 1D array if `ys` is 1D.
(m, b), ss_r, _, _, _ = np.polyfit(xs, ys, 1, full=True)
m, b, ss_r
(0.9920000000000001 -0.001999999999999821 array([0.00568]))
# calculate line via `polyfit` then total sum of squares
# y = mx + b
r_ys = xs*m + b
ss_tot = np.sum((r_ys - np.mean(r_ys)) ** 2)
r_squared = 1 - (ss_r / ss_tot)
r_squared
array([0.99630472])
Upvotes: 0
Reputation: 475
I have been using this successfully, where x and y are array-like.
Note: for linear regression only
def rsquared(x, y):
""" Return R^2 where x and y are array-like."""
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
return r_value**2
Upvotes: 36
Reputation: 6665
Using the numpy module (tested in python3):
import numpy as np
def linear_regression(x, y):
coefs = np.polynomial.polynomial.polyfit(x, y, 1)
ffit = np.poly1d(coefs)
m = ffit[0]
b = ffit[1]
eq = 'y = {}x + {}'.format(round(m, 3), round(b, 3))
rsquared = np.corrcoef(x, y)[0, 1]**2
return rsquared, eq, m, b
rsquared, eq, m, b = linear_regression(x,y)
print(rsquared, m, b)
print(eq)
Output:
0.013378252355751777 0.1316331351105754 0.7928782850418713 y = 0.132x + 0.793
Note: r² ≠ R²
r² is called the "Coefficient of Determination"
R² is the square of the Pearson Coefficient
R², officially conflated as r², is probably the one you want, as it's a least-square fit, which is better than the simple fraction of sums that r² is. Numpy is not afraid to call it "corrcoef", which presupposes Pearson is the de-facto correlation coefficient.
Upvotes: 3
Reputation: 2471
I originally posted the benchmarks below with the purpose of recommending numpy.corrcoef
, foolishly not realizing that the original question already uses corrcoef
and was in fact asking about higher order polynomial fits. I've added an actual solution to the polynomial r-squared question using statsmodels, and I've left the original benchmarks, which while off-topic, are potentially useful to someone.
statsmodels
has the capability to calculate the r^2
of a polynomial fit directly, here are 2 methods...
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Construct the columns for the different powers of x
def get_r2_statsmodels(x, y, k=1):
xpoly = np.column_stack([x**i for i in range(k+1)])
return sm.OLS(y, xpoly).fit().rsquared
# Use the formula API and construct a formula describing the polynomial
def get_r2_statsmodels_formula(x, y, k=1):
formula = 'y ~ 1 + ' + ' + '.join('I(x**{})'.format(i) for i in range(1, k+1))
data = {'x': x, 'y': y}
return smf.ols(formula, data).fit().rsquared # or rsquared_adj
To further take advantage of statsmodels
, one should also look at the fitted model summary, which can be printed or displayed as a rich HTML table in Jupyter/IPython notebook. The results object provides access to many useful statistical metrics in addition to rsquared
.
model = sm.OLS(y, xpoly)
results = model.fit()
results.summary()
Below is my original Answer where I benchmarked various linear regression r^2 methods...
The corrcoef function used in the Question calculates the correlation coefficient, r
, only for a single linear regression, so it doesn't address the question of r^2
for higher order polynomial fits. However, for what it's worth, I've come to find that for linear regression, it is indeed the fastest and most direct method of calculating r
.
def get_r2_numpy_corrcoef(x, y):
return np.corrcoef(x, y)[0, 1]**2
These were my timeit results from comparing a bunch of methods for 1000 random (x, y) points:
r
calculation)
r
calculation)
r
calculation)
r
as an output)
The corrcoef method narrowly beats calculating the r^2 "manually" using numpy methods. It is >5X faster than the polyfit method and ~12X faster than the scipy.linregress. Just to reinforce what numpy is doing for you, it's 28X faster than pure python. I'm not well-versed in things like numba and pypy, so someone else would have to fill those gaps, but I think this is plenty convincing to me that corrcoef
is the best tool for calculating r
for a simple linear regression.
Here's my benchmarking code. I copy-pasted from a Jupyter Notebook (hard not to call it an IPython Notebook...), so I apologize if anything broke on the way. The %timeit magic command requires IPython.
import numpy as np
from scipy import stats
import statsmodels.api as sm
import math
n=1000
x = np.random.rand(1000)*10
x.sort()
y = 10 * x + (5+np.random.randn(1000)*10-5)
x_list = list(x)
y_list = list(y)
def get_r2_numpy(x, y):
slope, intercept = np.polyfit(x, y, 1)
r_squared = 1 - (sum((y - (slope * x + intercept))**2) / ((len(y) - 1) * np.var(y, ddof=1)))
return r_squared
def get_r2_scipy(x, y):
_, _, r_value, _, _ = stats.linregress(x, y)
return r_value**2
def get_r2_statsmodels(x, y):
return sm.OLS(y, sm.add_constant(x)).fit().rsquared
def get_r2_python(x_list, y_list):
n = len(x_list)
x_bar = sum(x_list)/n
y_bar = sum(y_list)/n
x_std = math.sqrt(sum([(xi-x_bar)**2 for xi in x_list])/(n-1))
y_std = math.sqrt(sum([(yi-y_bar)**2 for yi in y_list])/(n-1))
zx = [(xi-x_bar)/x_std for xi in x_list]
zy = [(yi-y_bar)/y_std for yi in y_list]
r = sum(zxi*zyi for zxi, zyi in zip(zx, zy))/(n-1)
return r**2
def get_r2_numpy_manual(x, y):
zx = (x-np.mean(x))/np.std(x, ddof=1)
zy = (y-np.mean(y))/np.std(y, ddof=1)
r = np.sum(zx*zy)/(len(x)-1)
return r**2
def get_r2_numpy_corrcoef(x, y):
return np.corrcoef(x, y)[0, 1]**2
print('Python')
%timeit get_r2_python(x_list, y_list)
print('Numpy polyfit')
%timeit get_r2_numpy(x, y)
print('Numpy Manual')
%timeit get_r2_numpy_manual(x, y)
print('Numpy corrcoef')
%timeit get_r2_numpy_corrcoef(x, y)
print('Scipy')
%timeit get_r2_scipy(x, y)
print('Statsmodels')
%timeit get_r2_statsmodels(x, y)
7/28/21 Benchmark results. (Python 3.7, numpy 1.19, scipy 1.6, statsmodels 0.12)
Python
2.41 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy polyfit
318 µs ± 44.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy Manual
79.3 µs ± 4.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy corrcoef
83.8 µs ± 1.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Scipy
221 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Statsmodels
375 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 30
Reputation: 9302
R-squared is a statistic that only applies to linear regression.
Essentially, it measures how much variation in your data can be explained by the linear regression.
So, you calculate the "Total Sum of Squares", which is the total squared deviation of each of your outcome variables from their mean. . .
where y_bar is the mean of the y's.
Then, you calculate the "regression sum of squares", which is how much your FITTED values differ from the mean
and find the ratio of those two.
Now, all you would have to do for a polynomial fit is plug in the y_hat's from that model, but it's not accurate to call that r-squared.
Here is a link I found that speaks to it a little.
Upvotes: 6
Reputation: 799
You can execute this code directly, this will find you the polynomial, and will find you the R-value you can put a comment down below if you need more explanation.
from scipy.stats import linregress
import numpy as np
x = np.array([1,2,3,4,5,6])
y = np.array([2,3,5,6,7,8])
p3 = np.polyfit(x,y,3) # 3rd degree polynomial, you can change it to any degree you want
xp = np.linspace(1,6,6) # 6 means the length of the line
poly_arr = np.polyval(p3,xp)
poly_list = [round(num, 3) for num in list(poly_arr)]
slope, intercept, r_value, p_value, std_err = linregress(x, poly_list)
print(r_value**2)
Upvotes: 1
Reputation: 20256
Here's a very simple python function to compute R^2 from the actual and predicted values assuming y and y_hat are pandas series:
def r_squared(y, y_hat):
y_bar = y.mean()
ss_tot = ((y-y_bar)**2).sum()
ss_res = ((y-y_hat)**2).sum()
return 1 - (ss_res/ss_tot)
Upvotes: 10
Reputation: 20373
From yanl (yet-another-library) sklearn.metrics
has an r2_score
function;
from sklearn.metrics import r2_score
coefficient_of_dermination = r2_score(y, p(x))
Upvotes: 90
Reputation:
From scipy.stats.linregress source. They use the average sum of squares method.
import numpy as np
x = np.array(x)
y = np.array(y)
# average sum of squares:
ssxm, ssxym, ssyxm, ssym = np.cov(x, y, bias=1).flat
r_num = ssxym
r_den = np.sqrt(ssxm * ssym)
r = r_num / r_den
if r_den == 0.0:
r = 0.0
else:
r = r_num / r_den
if r > 1.0:
r = 1.0
elif r < -1.0:
r = -1.0
Upvotes: 0
Reputation: 8492
A very late reply, but just in case someone needs a ready function for this:
i.e.
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(x, y)
as in @Adam Marples's answer.
Upvotes: 209
Reputation: 83427
Here is a function to compute the weighted r-squared with Python and Numpy (most of the code comes from sklearn):
from __future__ import division
import numpy as np
def compute_r2_weighted(y_true, y_pred, weight):
sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
tse = (weight * (y_true - np.average(
y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
r2_score = 1 - (sse / tse)
return r2_score, sse, tse
Example:
from __future__ import print_function, division
import sklearn.metrics
def compute_r2_weighted(y_true, y_pred, weight):
sse = (weight * (y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
tse = (weight * (y_true - np.average(
y_true, axis=0, weights=weight)) ** 2).sum(axis=0, dtype=np.float64)
r2_score = 1 - (sse / tse)
return r2_score, sse, tse
def compute_r2(y_true, y_predicted):
sse = sum((y_true - y_predicted)**2)
tse = (len(y_true) - 1) * np.var(y_true, ddof=1)
r2_score = 1 - (sse / tse)
return r2_score, sse, tse
def main():
'''
Demonstrate the use of compute_r2_weighted() and checks the results against sklearn
'''
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
weight = [1, 5, 1, 2]
r2_score = sklearn.metrics.r2_score(y_true, y_pred)
print('r2_score: {0}'.format(r2_score))
r2_score,_,_ = compute_r2(np.array(y_true), np.array(y_pred))
print('r2_score: {0}'.format(r2_score))
r2_score = sklearn.metrics.r2_score(y_true, y_pred,weight)
print('r2_score weighted: {0}'.format(r2_score))
r2_score,_,_ = compute_r2_weighted(np.array(y_true), np.array(y_pred), np.array(weight))
print('r2_score weighted: {0}'.format(r2_score))
if __name__ == "__main__":
main()
#cProfile.run('main()') # if you want to do some profiling
outputs:
r2_score: 0.9486081370449679
r2_score: 0.9486081370449679
r2_score weighted: 0.9573170731707317
r2_score weighted: 0.9573170731707317
This corresponds to the formula (mirror):
with f_i is the predicted value from the fit, y_{av} is the mean of the observed data y_i is the observed data value. w_i is the weighting applied to each data point, usually w_i=1. SSE is the sum of squares due to error and SST is the total sum of squares.
If interested, the code in R: https://gist.github.com/dhimmel/588d64a73fa4fef02c8f (mirror)
Upvotes: 10
Reputation: 3043
From the numpy.polyfit documentation, it is fitting linear regression. Specifically, numpy.polyfit with degree 'd' fits a linear regression with the mean function
E(y|x) = p_d * x**d + p_{d-1} * x **(d-1) + ... + p_1 * x + p_0
So you just need to calculate the R-squared for that fit. The wikipedia page on linear regression gives full details. You are interested in R^2 which you can calculate in a couple of ways, the easisest probably being
SST = Sum(i=1..n) (y_i - y_bar)^2
SSReg = Sum(i=1..n) (y_ihat - y_bar)^2
Rsquared = SSReg/SST
Where I use 'y_bar' for the mean of the y's, and 'y_ihat' to be the fit value for each point.
I'm not terribly familiar with numpy (I usually work in R), so there is probably a tidier way to calculate your R-squared, but the following should be correct
import numpy
# Polynomial Regression
def polyfit(x, y, degree):
results = {}
coeffs = numpy.polyfit(x, y, degree)
# Polynomial Coefficients
results['polynomial'] = coeffs.tolist()
# r-squared
p = numpy.poly1d(coeffs)
# fit values, and mean
yhat = p(x) # or [p(z) for z in x]
ybar = numpy.sum(y)/len(y) # or sum(y)/len(y)
ssreg = numpy.sum((yhat-ybar)**2) # or sum([ (yihat - ybar)**2 for yihat in yhat])
sstot = numpy.sum((y - ybar)**2) # or sum([ (yi - ybar)**2 for yi in y])
results['determination'] = ssreg / sstot
return results
Upvotes: 84
Reputation:
The wikipedia article on r-squareds suggests that it may be used for general model fitting rather than just linear regression.
Upvotes: 5