numpy polyfit will not fit parabola properly

Question

I am trying to use polyfit to fit a parabola to the set of data points in "data." My program is working for other data sets that I try, but will not work with my particular data set.

I have tried making sure the x-data is sorted properly in ascending order, and I have tried fitting the data in excel. The fit looks fine in excel.

data = [[0.16888549099999922, 7.127084528823611], [0.16888549199999922, 6.993992044491425], [0.16888549299999922, 6.866362061761786], [0.16888549399999922, 6.744197905413327], [0.16888549499999922, 6.627501951010496], [0.16888549599999922, 6.516275651493945], [0.16888549699999922, 6.41051952560987], [0.16888549799999922, 6.310233194927246], [0.16888549899999922, 6.215415356951307], [0.16888549999999922, 6.1260638293986895], [0.16888550099999922, 6.042175535068139], [0.16888550199999922, 5.963746518271748], [0.16888550299999922, 5.890771966813277], [0.16888550399999921, 5.823246197254835], [0.16888550499999921, 5.761162692791979], [0.1688855059999992, 5.704514090084273], [0.1688855069999992, 5.653292206179698], [0.1688855079999992, 5.6074880385927806], [0.1688855089999992, 5.567091780688852], [0.1688855099999992, 5.532092833944616], [0.1688855109999992, 5.502479813247546], [0.1688855119999992, 5.4782405647173915], [0.1688855129999992, 5.459362171880384], [0.1688855139999992, 5.4458309690410776], [0.1688855149999992, 5.437632552360041], [0.1688855159999992, 5.434751791325787], [0.1688855169999992, 5.43717284006055], [0.1688855179999992, 5.444879149278103], [0.1688855189999992, 5.457853477721287], [0.1688855199999992, 5.47607790389057], [0.1688855209999992, 5.4995338388052435], [0.1688855219999992, 5.52820203573491], [0.1688855229999992, 5.562062602591132], [0.1688855239999992, 5.601095018180885], [0.1688855249999992, 5.645278137030775], [0.1688855259999992, 5.694590207075108], [0.1688855269999992, 5.7490088760994285], [0.1688855279999992, 5.808511211230821], [0.1688855289999992, 5.873073702228439], [0.1688855299999992, 5.942672282488598], [0.1688855309999992, 6.017282332411738], [0.1688855319999992, 6.096878700665825], [0.1688855329999992, 6.1814357028181135], [0.1688855339999992, 6.27092714427376], [0.1688855349999992, 6.365326333275836], [0.1688855359999992, 6.464606084709255], [0.1688855369999992, 6.56873873137477], [0.1688855379999992, 6.677696151240223], [0.1688855389999992, 6.791449748897025], [0.1688855399999992, 6.909970506323868]]

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('TkAgg')

x = list()
y = list()
for i in data:
    x.append(i[0])
    y.append(i[1])

fit = np.polyfit(x, y, 2)
f = np.poly1d(fit)

plt.scatter(x,y)
plt.plot(x,f(x))
plt.xlim(0.16888549099999922 - 0.000000001,0.1688855399999992 + 0.000000001)
plt.show()

I am getting a fit that looks like a line. There is an error that the fit is poorly conditioned. I cannot figure out why this is happening in my code, so I set up this little code that clearly demonstrates the issue. Can anybody help me with this?

unutbu · Accepted Answer

The difference between the x-values is very small:

In [40]: np.diff(x).max()
Out[54]: 9.999999994736442e-10

Some numerical recipes work better when the inputs are not extremely tiny. (For example, an algorithm which starts with a fixed step size of, say, 0.1 might be fine for most unit-sized data, but totally overshoot the optimal coefficients in your case.)

If you normalize your data:

x = (x - x.mean())/x.std()

then you get a more sensible result:

data = [[0.16888549099999922, 7.127084528823611], [0.16888549199999922, 6.993992044491425], [0.16888549299999922, 6.866362061761786], [0.16888549399999922, 6.744197905413327], [0.16888549499999922, 6.627501951010496], [0.16888549599999922, 6.516275651493945], [0.16888549699999922, 6.41051952560987], [0.16888549799999922, 6.310233194927246], [0.16888549899999922, 6.215415356951307], [0.16888549999999922, 6.1260638293986895], [0.16888550099999922, 6.042175535068139], [0.16888550199999922, 5.963746518271748], [0.16888550299999922, 5.890771966813277], [0.16888550399999921, 5.823246197254835], [0.16888550499999921, 5.761162692791979], [0.1688855059999992, 5.704514090084273], [0.1688855069999992, 5.653292206179698], [0.1688855079999992, 5.6074880385927806], [0.1688855089999992, 5.567091780688852], [0.1688855099999992, 5.532092833944616], [0.1688855109999992, 5.502479813247546], [0.1688855119999992, 5.4782405647173915], [0.1688855129999992, 5.459362171880384], [0.1688855139999992, 5.4458309690410776], [0.1688855149999992, 5.437632552360041], [0.1688855159999992, 5.434751791325787], [0.1688855169999992, 5.43717284006055], [0.1688855179999992, 5.444879149278103], [0.1688855189999992, 5.457853477721287], [0.1688855199999992, 5.47607790389057], [0.1688855209999992, 5.4995338388052435], [0.1688855219999992, 5.52820203573491], [0.1688855229999992, 5.562062602591132], [0.1688855239999992, 5.601095018180885], [0.1688855249999992, 5.645278137030775], [0.1688855259999992, 5.694590207075108], [0.1688855269999992, 5.7490088760994285], [0.1688855279999992, 5.808511211230821], [0.1688855289999992, 5.873073702228439], [0.1688855299999992, 5.942672282488598], [0.1688855309999992, 6.017282332411738], [0.1688855319999992, 6.096878700665825], [0.1688855329999992, 6.1814357028181135], [0.1688855339999992, 6.27092714427376], [0.1688855349999992, 6.365326333275836], [0.1688855359999992, 6.464606084709255], [0.1688855369999992, 6.56873873137477], [0.1688855379999992, 6.677696151240223], [0.1688855389999992, 6.791449748897025], [0.1688855399999992, 6.909970506323868]]

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('TkAgg')

x, y = map(np.array, zip(*data))
x_normalized = (x - x.mean())/x.std()
fit_normalized = np.polyfit(x_normalized, y, 2)

f = np.poly1d(fit_normalized)
plt.scatter(x, y, marker='o', c='red')
plt.plot(x,f(x_normalized), c='black')
plt.xlim(0.16888549099999922 - 0.000000001,0.1688855399999992 + 0.000000001)
plt.show()

yields

Above, fit_normalized = np.polyfit(x_normalized, y, 2) computes coefficients with respect to normalized x-data. To find the coefficients with respect to the original data, do a side computation:

Let

m, s = x.mean(), x.std()
x_normalized = (x - m)/s

You can think of this as a coordinate transformation. Then

y = a * x_normalized**2 + b * x_normalized + c
y = a * ((x - m)/s)**2 + b * ((x - m)/s) + c

Now you can expand and collect terms to find the coefficients with respect to x. Or you could use a symbolic math package like sympy:

In [55]: import sympy as sym
In [57]: x, a, b, c, m, s = sym.symbols('x a b c m s')
In [104]: sym.poly(a * ((x - m)/s)**2 + b * ((x - m)/s) + c, x).coeffs()
Out[104]: [a/s**2, (-2*a*m + b*s)/s**2, (a*m**2 - b*m*s + c*s**2)/s**2]

which shows that

y = a/s**2 * x**2 + (-2*a*m + b*s)/s**2 * x + (a*m**2 - b*m*s + c*s**2)/s**2

Just to show that the above calculation leads to a sensible fit:

data = [[0.16888549099999922, 7.127084528823611], [0.16888549199999922, 6.993992044491425], [0.16888549299999922, 6.866362061761786], [0.16888549399999922, 6.744197905413327], [0.16888549499999922, 6.627501951010496], [0.16888549599999922, 6.516275651493945], [0.16888549699999922, 6.41051952560987], [0.16888549799999922, 6.310233194927246], [0.16888549899999922, 6.215415356951307], [0.16888549999999922, 6.1260638293986895], [0.16888550099999922, 6.042175535068139], [0.16888550199999922, 5.963746518271748], [0.16888550299999922, 5.890771966813277], [0.16888550399999921, 5.823246197254835], [0.16888550499999921, 5.761162692791979], [0.1688855059999992, 5.704514090084273], [0.1688855069999992, 5.653292206179698], [0.1688855079999992, 5.6074880385927806], [0.1688855089999992, 5.567091780688852], [0.1688855099999992, 5.532092833944616], [0.1688855109999992, 5.502479813247546], [0.1688855119999992, 5.4782405647173915], [0.1688855129999992, 5.459362171880384], [0.1688855139999992, 5.4458309690410776], [0.1688855149999992, 5.437632552360041], [0.1688855159999992, 5.434751791325787], [0.1688855169999992, 5.43717284006055], [0.1688855179999992, 5.444879149278103], [0.1688855189999992, 5.457853477721287], [0.1688855199999992, 5.47607790389057], [0.1688855209999992, 5.4995338388052435], [0.1688855219999992, 5.52820203573491], [0.1688855229999992, 5.562062602591132], [0.1688855239999992, 5.601095018180885], [0.1688855249999992, 5.645278137030775], [0.1688855259999992, 5.694590207075108], [0.1688855269999992, 5.7490088760994285], [0.1688855279999992, 5.808511211230821], [0.1688855289999992, 5.873073702228439], [0.1688855299999992, 5.942672282488598], [0.1688855309999992, 6.017282332411738], [0.1688855319999992, 6.096878700665825], [0.1688855329999992, 6.1814357028181135], [0.1688855339999992, 6.27092714427376], [0.1688855349999992, 6.365326333275836], [0.1688855359999992, 6.464606084709255], [0.1688855369999992, 6.56873873137477], [0.1688855379999992, 6.677696151240223], [0.1688855389999992, 6.791449748897025], [0.1688855399999992, 6.909970506323868]]

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('TkAgg')

x, y = map(np.array, zip(*data))
m, s = x.mean(), x.std()
x_normalized = (x - m)/s

fit_normalized = a, b, c = np.polyfit(x_normalized, y, 2)
fit = (a / s**2), (-2*a*m/s**2 + b/s), a*m**2/s**2 - b*m/s + c

print(fit_normalized)
# [ 0.54960506 -0.05561036  5.43651191]

print(fit)
# (2639159957611392.0, -891431783709882.0, 75274958487869.69)

f = np.poly1d(fit)
plt.scatter(x, y, marker='o', c='red')
plt.plot(x, f(x), c='black')
plt.xlim(0.16888549099999922 - 0.000000001,0.1688855399999992 + 0.000000001)
plt.show()

yields roughly the same plot. Note that s is very small, so division by s leads to rather big numbers, and small errors in s will lead to large errors in the coefficients. Also bear in mind that NumPy floating point arrays have dtypes with limited precision. So depending on your actual data set, the imputed coefficients may not be very accurate. You may need to compute the coefficients yourself with an arbitrary-precision math package (like decimal or gmpy2) to do better, but then again, if your inputs are not known with high precision, then high-precision computations on low-precision data is not going to help.

numpy polyfit will not fit parabola properly

Answers (1)

Related Questions