fransua
fransua

Reputation: 1608

Linear regression ODR fails

Following the recommendations in this answer I have used several combination of values for beta0, and as shown here, the values from polyfit.

This example is UPDATED in order to show the effect of relative scales of values of X versus Y (X range is 0.1 to 100 times Y):

from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt

seed(1)
X = np.array([random() for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])

for num in range(1, 5):
    plt.subplot(2, 2, num)
    plt.title('X range is %.1f times Y' % (float(100 / max(X))))
    X *= 10
    z = np.polyfit(X, Y, 1)
    plt.plot(X, Y, 'k.', alpha=0.1)

    # Fit using odr
    def f(B, X):
        return B[0]*X + B[1]    

    linear = odr.Model(f)
    mydata = odr.RealData(X, Y)
    myodr = odr.ODR(mydata, linear, beta0=z)
    myodr.set_job(fit_type=0)
    myoutput = myodr.run()
    a, b = myoutput.beta
    sa, sb = myoutput.sd_beta
    xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
    yp = a*xp+b
    plt.plot(xp, yp, label='ODR')
    yp2 = z[0]*xp+z[1]
    plt.plot(xp, yp2, label='polyfit')
    plt.legend()
    plt.ylim(-1000, 2000)
plt.show()

It seems that no combination of beta0 helps... The only way to get polyfit and ODR fit similar is to swap X and Y, OR as shown here to increase the range of values of X with regard to Y, still not really a solution :)

new example

=== EDIT ===

I do not want ODR to be the same as polyfit. I am showing polyfit just to emphasize that the ODR fit is wrong and it is not a problem of the data.

=== SOLUTION ===

thanks to @norok2 answer when Y range is 0.001 to 100000 times X:

from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() / 1000 for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
plt.figure(figsize=(12, 12))
for num in range(1, 10):
    plt.subplot(3, 3, num)
    plt.title('Y range is %.1f times X' % (float(100 / max(X))))
    X *= 10
    z = np.polyfit(X, Y, 1)
    plt.plot(X, Y, 'k.', alpha=0.1)
    # Fit using odr
    def f(B, X):
        return B[0]*X + B[1]    
    linear = odr.Model(f)
    mydata = odr.RealData(X, Y, 
                          sy=min(1/np.var(Y), 1/np.var(X)))  # here the trick!! :)
    myodr = odr.ODR(mydata, linear, beta0=z)
    myodr.set_job(fit_type=0)
    myoutput = myodr.run()
    a, b = myoutput.beta
    sa, sb = myoutput.sd_beta
    xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
    yp = a*xp+b
    plt.plot(xp, yp, label='ODR')
    yp2 = z[0]*xp+z[1]
    plt.plot(xp, yp2, label='polyfit')

    plt.legend()
    plt.ylim(-1000, 2000)
plt.show()

example3

Upvotes: 5

Views: 1697

Answers (2)

norok2
norok2

Reputation: 26906

The key difference between polyfit() and the Orthogonal Distance Regression (ODR) fit is that polyfit works under the assumption that the error on x is negligible. If this assumption is violated, like it is in your data, you cannot expect the two methods to produce similar results. In particular, ODR() is very sensitive to the errors you specify. If you do not specify any error/weighting, it will assign a value of 1 for both x and y, meaning that any scale difference between x and y will affect the results (the so-called numerical conditioning).

On the contrary, polyfit(), before computing the fit, applies some sort of pre-whitening to the data (see around line 577 of its source code) for better numerical conditioning.

Therefore, if you want ODR() to match polyfit(), you could simply fine-tune the error on Y to change your numerical conditioning. I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).

mydata = odr.RealData(X, Y)
# equivalent to: odr.RealData(X, Y, sx=1, sy=1)

to:

mydata = odr.RealData(X, Y, sx=1, sy=1/np.var(Y))

(EDIT: note there was a typo on the line above)

I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).

Note that this would only make sense for well-conditioned fits.

Upvotes: 3

James Phillips
James Phillips

Reputation: 4657

I cannot format source code in a comment, and so place it here. This code uses ODR to calculate fit statistics, note the line that has "parameter order for odr" such that I use a wrapper function for the ODR call to my "actual" function.

from scipy.optimize import curve_fit
import numpy as np
import scipy.odr
import scipy.stats

x = np.array([5.357, 5.797, 5.936, 6.161, 6.697, 6.731, 6.775, 8.442, 9.861])
y = np.array([0.376, 0.874, 1.049, 1.327, 2.054, 2.077, 2.138, 4.744, 7.104])

def f(x,b0,b1):
    return b0 + (b1 * x)


def f_wrapper_for_odr(beta, x): # parameter order for odr
    return f(x, *beta)

parameters, cov= curve_fit(f, x, y)

model = scipy.odr.odrpack.Model(f_wrapper_for_odr)
data = scipy.odr.odrpack.Data(x,y)
myodr = scipy.odr.odrpack.ODR(data, model, beta0=parameters,  maxit=0)
myodr.set_job(fit_type=2)
parameterStatistics = myodr.run()
df_e = len(x) - len(parameters) # degrees of freedom, error
cov_beta = parameterStatistics.cov_beta # parameter covariance matrix from ODR
sd_beta = parameterStatistics.sd_beta * parameterStatistics.sd_beta
ci = []
t_df = scipy.stats.t.ppf(0.975, df_e)
ci = []
for i in range(len(parameters)):
    ci.append([parameters[i] - t_df * parameterStatistics.sd_beta[i], parameters[i] + t_df * parameterStatistics.sd_beta[i]])

tstat_beta = parameters / parameterStatistics.sd_beta # coeff t-statistics
pstat_beta = (1.0 - scipy.stats.t.cdf(np.abs(tstat_beta), df_e)) * 2.0    # coef. p-values

for i in range(len(parameters)):
    print('parameter:', parameters[i])
    print('   conf interval:', ci[i][0], ci[i][1])
    print('   tstat:', tstat_beta[i])
    print('   pstat:', pstat_beta[i])
    print()

Upvotes: 1

Related Questions