Linear regression ODR fails

Question

Following the recommendations in this answer I have used several combination of values for beta0, and as shown here, the values from polyfit.

This example is UPDATED in order to show the effect of relative scales of values of X versus Y (X range is 0.1 to 100 times Y):

from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt

seed(1)
X = np.array([random() for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])

for num in range(1, 5):
    plt.subplot(2, 2, num)
    plt.title('X range is %.1f times Y' % (float(100 / max(X))))
    X *= 10
    z = np.polyfit(X, Y, 1)
    plt.plot(X, Y, 'k.', alpha=0.1)

    # Fit using odr
    def f(B, X):
        return B[0]*X + B[1]    

    linear = odr.Model(f)
    mydata = odr.RealData(X, Y)
    myodr = odr.ODR(mydata, linear, beta0=z)
    myodr.set_job(fit_type=0)
    myoutput = myodr.run()
    a, b = myoutput.beta
    sa, sb = myoutput.sd_beta
    xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
    yp = a*xp+b
    plt.plot(xp, yp, label='ODR')
    yp2 = z[0]*xp+z[1]
    plt.plot(xp, yp2, label='polyfit')
    plt.legend()
    plt.ylim(-1000, 2000)
plt.show()

It seems that no combination of beta0 helps... The only way to get polyfit and ODR fit similar is to swap X and Y, OR as shown here to increase the range of values of X with regard to Y, still not really a solution :)

=== EDIT ===

I do not want ODR to be the same as polyfit. I am showing polyfit just to emphasize that the ODR fit is wrong and it is not a problem of the data.

=== SOLUTION ===

thanks to @norok2 answer when Y range is 0.001 to 100000 times X:

from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() / 1000 for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
plt.figure(figsize=(12, 12))
for num in range(1, 10):
    plt.subplot(3, 3, num)
    plt.title('Y range is %.1f times X' % (float(100 / max(X))))
    X *= 10
    z = np.polyfit(X, Y, 1)
    plt.plot(X, Y, 'k.', alpha=0.1)
    # Fit using odr
    def f(B, X):
        return B[0]*X + B[1]    
    linear = odr.Model(f)
    mydata = odr.RealData(X, Y, 
                          sy=min(1/np.var(Y), 1/np.var(X)))  # here the trick!! :)
    myodr = odr.ODR(mydata, linear, beta0=z)
    myodr.set_job(fit_type=0)
    myoutput = myodr.run()
    a, b = myoutput.beta
    sa, sb = myoutput.sd_beta
    xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
    yp = a*xp+b
    plt.plot(xp, yp, label='ODR')
    yp2 = z[0]*xp+z[1]
    plt.plot(xp, yp2, label='polyfit')

    plt.legend()
    plt.ylim(-1000, 2000)
plt.show()

norok2 · Accepted Answer

The key difference between polyfit() and the Orthogonal Distance Regression (ODR) fit is that polyfit works under the assumption that the error on x is negligible. If this assumption is violated, like it is in your data, you cannot expect the two methods to produce similar results. In particular, ODR() is very sensitive to the errors you specify. If you do not specify any error/weighting, it will assign a value of 1 for both x and y, meaning that any scale difference between x and y will affect the results (the so-called numerical conditioning).

On the contrary, polyfit(), before computing the fit, applies some sort of pre-whitening to the data (see around line 577 of its source code) for better numerical conditioning.

Therefore, if you want ODR() to match polyfit(), you could simply fine-tune the error on Y to change your numerical conditioning. I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).

mydata = odr.RealData(X, Y)
# equivalent to: odr.RealData(X, Y, sx=1, sy=1)

to:

mydata = odr.RealData(X, Y, sx=1, sy=1/np.var(Y))

(EDIT: note there was a typo on the line above)

I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).

Note that this would only make sense for well-conditioned fits.

Linear regression ODR fails

Answers (2)

Related Questions