Reputation: 568
The curve is:
import numpy as np
import scipy.stats as sp
from scipy.optimize import curve_fit
from lmfit import minimize, Parameters, Parameter, report_fit#
import xlwings as xw
import os
import pandas as pd
I tried running a simply curve fit from scipy: This returns
Out[156]:
(array([ 1., 1.]), array([[ inf, inf],
[ inf, inf]]))
Ok, so I thought I need to start bounding my parameters, which is what solver does. I used lmfit to do this:
params = Parameters()
params.add
I tried to nudge python to the right direction by changing the starting parameters:
or using curvefit:
Upvotes: 4
Views: 1839
Reputation: 53119
The curve fitting runs smoothly when we provide a good starting point. We can get one
sp.norm.ppf(x_data)
and np.log(y_data)
Alternatively, if you want the computer to find the solution without "help"
All four methods yield the same result and it is better than the excel result.
import numpy as np
import scipy.stats as sp
from scipy.optimize import curve_fit, basinhopping, brute
def func1(x, n1, n2):
return np.clip(np.exp(n2 + n1*(sp.norm.ppf(x))),25e4,10e6)
def func_free(x, n1, n2):
return np.exp(n2 + n1*(sp.norm.ppf(x)))
def sqerr(n12, x, y):
return ((func1(x, *n12) - y)**2).sum()
x_data = np.array((1e-04,9e-01,9.5835e-01,9.8e-01,9.9e-01,9.9e-01))
y_data = np.array((250e3,1e6,2.5e6,5e6,7.5e6,10e6))
# get a good starting point
# either by linear regression
lin = sp.linregress(sp.norm.ppf(x_data), np.log(y_data))
# or by using the free (non-clipped) version of the formula
(n1f, n2f), infof = curve_fit(func_free, x_data, y_data, (1, 1))
# use those on the original problem
(n1, n2), info = curve_fit(func1, x_data, y_data, (lin.slope, lin.intercept))
(n12, n22), info2 = curve_fit(func1, x_data, y_data, (n1f, n2f))
# OR
# use basin hopping
hop = basinhopping(sqerr, (1, 1), minimizer_kwargs=dict(args=(x_data, y_data)), stepsize=10)
# OR
# brute force it
brt = brute(sqerr, ((-100, 100), (-100, 100)), (x_data, y_data), 201, full_output=True)
# all four solutions are essentially the same:
assert np.allclose((n1, n2), (n12, n22))
assert np.allclose((n1, n2), hop.x)
assert np.allclose((n1, n2), brt[0])
# we are actually a bit better than excel
n1excel, n2excel = 1.7925, 11.6771
print('solution', n1, n2)
print('error', ((func1(x_data, n1, n2) - y_data)**2).sum())
print('excel', ((func1(x_data, n1excel, n2excel) - y_data)**2).sum())
Output:
solution 2.08286042997 11.1397332743
error 3.12796761241e+12
excel 5.80088578059e+12
Remark: One easy optimization - which I left out for simplicity and because things are fast enough anyway - would have been to pull sp.norm.ppf
out of the model function. This is possible because it does not depend on the fit parameters. So when any of our solvers calls the function it always does the exact same computation - sp.norm.ppf(x_data)
- first, so we might as well have precomputed it.
This observation is also our reason for using sp.norm.ppf(x_data)
in the linear regression.
Upvotes: 1
Reputation: 44615
Here is a simple way to perform regression using scipy.optimize.curve_fit:
import matplotlib.pyplot as plt
import scipy.optimize as opt
import scipy.stats as stats
import numpy as np
% matplotlib inline
# Objective
def model(x, n1, n2):
return np.exp(n2 + n1*(stats.norm.ppf(x)))
# Data
x_samp = np.array((1e-04, 9e-01, 9.5835e-01, 9.8e-01, 9.9e-01, 9.9e-01))
y_samp = np.array((250e3, 1e6, 2.5e6, 5e6 ,7.5e6, 10e6))
x_lin = np.linspace(min(x_samp), max(x_samp), 50) # for fitting
# Regression
p0 = [5, 5] # guessed params
w, cov = opt.curve_fit(model, x_samp, y_samp, p0=p0)
print("Estimated Parameters", w)
y_fit = model(x_lin, *w)
# Visualization
plt.plot(x_samp, y_samp, "ko", label="Data")
plt.plot(x_lin, y_fit, "k--", label="Fit")
plt.title("Curve Fitting")
plt.legend(loc="upper left")
Output
Estimated Parameters [ 2.08285048 11.13975585]
Details
Your data was plotted as-is, without transformations. It is easiest to perform such a regression if the model and initial parameters, p0
reasonably fit your data. When these items are supplied to scipy.optimize.curve_fit
, a tuple of the weights or optimized estimated parameters, w
are returned, along with a covariance matrix. We can compute one standard deviation from the diagonals of the matrix:
p_stdev = np.sqrt(np.diag(cov))
print("Std. Dev. of Params:", p_stdev)
# Std. Dev. of Params: [ 0.42281111 0.95945127]
We visually assess the goodness of fit by supplying these estimated parameters once more to the model and plotting the fitted line.
Upvotes: 0
Reputation: 7862
I think that part of the problem is that you have only 5 observations, 2 at the same value of x
and the model does not perfectly represent your data. I also recommend trying to fit in the log of the model to the log of the data. And, if you expect n2
to be ~10, you should use that as a starting value.
Arbitrarily applying min
and max
to the calculated model, especially with values so close to your data range, makes very little sense. Doing this will prevent the fit method from exploring the effect of changing the parameter values on the model function.
With some modifications to your code to omit the first data point (which seems to be throwing off the model), I find this:
import numpy as np
import scipy.stats as sp
from lmfit import Parameters, minimize, fit_report
import matplotlib.pyplot as plt
x_data = np.array((1e-04,9e-01,9.5835e-01,9.8e-01,9.9e-01,9.9e-01))
y_data = np.array((250e3,1e6,2.5e6,5e6,7.5e6,10e6))
x_data = x_data[1:]
y_data = y_data[1:]
def func(params, x, data, m1=250e3,m2=10e6):
n1 = params['n1'].value
n2 = params['n2'].value
model = n2 + n1*(sp.norm.ppf(x))
return np.log(data) - model
params = Parameters()
params.add('n1', value= 1, min=0.01, max=20)
params.add('n2', value= 10, min=0, max=20)
result = minimize(func, params, args=(x_data[1:-1], y_data[1:-1]))
print(fit_report(result))
n1 = result.params['n1'].value
n2 = result.params['n2'].value
ym = np.exp(n2 + n1*(sp.norm.ppf(x_data)))
plt.plot(x_data, y_data, 'o')
plt.plot(x_data, ym, '-')
plt.legend(['data', 'fit'])
plt.show()
gives a report of
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 15
# data points = 3
# variables = 2
chi-square = 0.006
reduced chi-square = 0.006
Akaike info crit = -14.438
Bayesian info crit = -16.241
[[Variables]]
n1: 1.85709072 +/- 0.190473 (10.26%) (init= 1)
n2: 11.5455736 +/- 0.390805 (3.38%) (init= 10)
[[Correlations]] (unreported correlations are < 0.100)
C(n1, n2) = -0.993
Upvotes: 1