user33484
user33484

Reputation: 568

Curve fit - function failing

The curve is:

import numpy as np
import scipy.stats as sp
from scipy.optimize import curve_fit
from lmfit import minimize, Parameters, Parameter, report_fit#
import xlwings as xw
import os
import pandas as pd

I tried running a simply curve fit from scipy: This returns

Out[156]: 
(array([ 1.,  1.]), array([[ inf,  inf],
        [ inf,  inf]]))

Ok, so I thought I need to start bounding my parameters, which is what solver does. I used lmfit to do this:

params = Parameters()

params.add

I tried to nudge python to the right direction by changing the starting parameters:

or using curvefit:

Upvotes: 4

Views: 1839

Answers (3)

Paul Panzer
Paul Panzer

Reputation: 53119

The curve fitting runs smoothly when we provide a good starting point. We can get one

  • by linear regression on sp.norm.ppf(x_data) and np.log(y_data)
  • or by fitting the free (non-clipped) model first

Alternatively, if you want the computer to find the solution without "help"

  • use a stochastic algorithm like basin hopping (it's unfortunately not 100% autonomous, I had to increase the step size from its default value)
  • brute force it (this requires the user to provide a search grid)

All four methods yield the same result and it is better than the excel result.

import numpy as np
import scipy.stats as sp
from scipy.optimize import curve_fit, basinhopping, brute

def func1(x, n1, n2):
    return np.clip(np.exp(n2 + n1*(sp.norm.ppf(x))),25e4,10e6)

def func_free(x, n1, n2):
    return np.exp(n2 + n1*(sp.norm.ppf(x)))

def sqerr(n12, x, y):
    return ((func1(x, *n12) - y)**2).sum()

x_data = np.array((1e-04,9e-01,9.5835e-01,9.8e-01,9.9e-01,9.9e-01))
y_data = np.array((250e3,1e6,2.5e6,5e6,7.5e6,10e6))

# get a good starting point

# either by linear regression
lin = sp.linregress(sp.norm.ppf(x_data), np.log(y_data))

# or by using the free (non-clipped) version of the formula
(n1f, n2f), infof = curve_fit(func_free, x_data, y_data, (1, 1))

# use those on the original problem 
(n1, n2), info = curve_fit(func1, x_data, y_data, (lin.slope, lin.intercept))
(n12, n22), info2 = curve_fit(func1, x_data, y_data, (n1f, n2f))

# OR

# use basin hopping
hop = basinhopping(sqerr, (1, 1), minimizer_kwargs=dict(args=(x_data, y_data)), stepsize=10)

# OR 

# brute force it
brt = brute(sqerr, ((-100, 100), (-100, 100)), (x_data, y_data), 201, full_output=True)

# all four solutions are essentially the same:
assert np.allclose((n1, n2), (n12, n22))
assert np.allclose((n1, n2), hop.x)
assert np.allclose((n1, n2), brt[0])

# we are actually a bit better than excel
n1excel, n2excel = 1.7925, 11.6771

print('solution', n1, n2)
print('error', ((func1(x_data, n1, n2) - y_data)**2).sum())
print('excel', ((func1(x_data, n1excel, n2excel) - y_data)**2).sum())

Output:

solution 2.08286042997 11.1397332743
error 3.12796761241e+12
excel 5.80088578059e+12

Remark: One easy optimization - which I left out for simplicity and because things are fast enough anyway - would have been to pull sp.norm.ppf out of the model function. This is possible because it does not depend on the fit parameters. So when any of our solvers calls the function it always does the exact same computation - sp.norm.ppf(x_data) - first, so we might as well have precomputed it.

This observation is also our reason for using sp.norm.ppf(x_data) in the linear regression.

Upvotes: 1

pylang
pylang

Reputation: 44615

Here is a simple way to perform regression using scipy.optimize.curve_fit:

import matplotlib.pyplot as plt
import scipy.optimize as opt
import scipy.stats as stats
import numpy as np

% matplotlib inline


# Objective
def model(x, n1, n2):
    return np.exp(n2 + n1*(stats.norm.ppf(x)))

# Data
x_samp = np.array((1e-04, 9e-01, 9.5835e-01, 9.8e-01, 9.9e-01, 9.9e-01))
y_samp = np.array((250e3, 1e6, 2.5e6, 5e6 ,7.5e6, 10e6))

x_lin = np.linspace(min(x_samp), max(x_samp), 50)  # for fitting

# Regression
p0 = [5, 5]                                        # guessed params
w, cov = opt.curve_fit(model, x_samp, y_samp, p0=p0)     
print("Estimated Parameters", w)  
y_fit = model(x_lin, *w)

# Visualization
plt.plot(x_samp, y_samp, "ko", label="Data")
plt.plot(x_lin, y_fit, "k--", label="Fit")
plt.title("Curve Fitting")
plt.legend(loc="upper left")

Output

Estimated Parameters [  2.08285048  11.13975585]

enter image description here


Details

Your data was plotted as-is, without transformations. It is easiest to perform such a regression if the model and initial parameters, p0 reasonably fit your data. When these items are supplied to scipy.optimize.curve_fit, a tuple of the weights or optimized estimated parameters, w are returned, along with a covariance matrix. We can compute one standard deviation from the diagonals of the matrix:

p_stdev = np.sqrt(np.diag(cov)) 
print("Std. Dev. of Params:", p_stdev)
# Std. Dev. of Params: [ 0.42281111  0.95945127]

We visually assess the goodness of fit by supplying these estimated parameters once more to the model and plotting the fitted line.

Upvotes: 0

M Newville
M Newville

Reputation: 7862

I think that part of the problem is that you have only 5 observations, 2 at the same value of x and the model does not perfectly represent your data. I also recommend trying to fit in the log of the model to the log of the data. And, if you expect n2 to be ~10, you should use that as a starting value.

Arbitrarily applying min and max to the calculated model, especially with values so close to your data range, makes very little sense. Doing this will prevent the fit method from exploring the effect of changing the parameter values on the model function.

With some modifications to your code to omit the first data point (which seems to be throwing off the model), I find this:

import numpy as np
import scipy.stats as sp
from lmfit import Parameters, minimize, fit_report

import matplotlib.pyplot as plt

x_data = np.array((1e-04,9e-01,9.5835e-01,9.8e-01,9.9e-01,9.9e-01))
y_data = np.array((250e3,1e6,2.5e6,5e6,7.5e6,10e6))

x_data = x_data[1:]
y_data = y_data[1:]

def func(params, x, data, m1=250e3,m2=10e6):
    n1 = params['n1'].value
    n2 = params['n2'].value

    model = n2 + n1*(sp.norm.ppf(x))
    return np.log(data) - model

params = Parameters()

params.add('n1', value= 1, min=0.01, max=20)
params.add('n2', value= 10, min=0, max=20)

result = minimize(func, params, args=(x_data[1:-1], y_data[1:-1]))

print(fit_report(result))
n1 = result.params['n1'].value
n2 = result.params['n2'].value 
ym = np.exp(n2 + n1*(sp.norm.ppf(x_data)))

plt.plot(x_data, y_data, 'o')
plt.plot(x_data, ym, '-')
plt.legend(['data', 'fit'])
plt.show()

gives a report of

[[Fit Statistics]]
    # fitting method   = leastsq
    # function evals   = 15
    # data points      = 3
    # variables        = 2
    chi-square         = 0.006
    reduced chi-square = 0.006
    Akaike info crit   = -14.438
    Bayesian info crit = -16.241
[[Variables]]
    n1:   1.85709072 +/- 0.190473 (10.26%) (init= 1)
    n2:   11.5455736 +/- 0.390805 (3.38%) (init= 10)
[[Correlations]] (unreported correlations are <  0.100)
    C(n1, n2)                    = -0.993

and a plot of enter image description here

Upvotes: 1

Related Questions