Reputation: 5629
I'm currently trying to implement a confidence interval on some parameters, with the bootstrap method. However, I have a little problem. Even if I use 3000 samples, my confidence intervals vary a lot.
Here is the situation:
I have a dataset of about 300 points, defined in the traditional way y=f(x). I know the model which fits the data. So I find the parameters with curve_fit, and I try to establish the confidence interval for each of them. I tried to mix the methods described here:
confidence interval with leastsq fit in scipy python
and here:
http://www.variousconsequences.com/2010/02/visualizing-confidence-intervals.html
Here is the code I use:
def model(t, Vs, Vi, k):
"""
Fitting model, following a Burst kinetics.
t is the time
Vs is the steady velocity
Vi is the initial velocity
k is the Burst rate constant
"""
y = Vs * t - ((Vs - Vi) * (1 - np.exp(-k * t)) / k)
return y
[some code]
bootindex = np.random.random_integers
nboot = 3000
local_t = np.array(local_t)
local_fluo = np.array(local_fluo)
concentration = np.array(concentration)
#Initializing time values in hours
local_scaled_t = [ index /3600 for index in local_t ]
local_scaled_t = np.array(local_scaled_t)
conc_produit = [ concentration[0] - value_conc for value_conc in concentration ]
conc_produit = np.array(conc_produit)
popt, pcov = curve_fit(model, local_scaled_t, conc_produit, maxfev=3000)
popt = [ popt[0] / 3600, popt[1] / 3600 , popt[2] / 3600 ]
ymod = list()
for each in local_t:
ymod.append(model(each, popt[0], popt[1], popt[2]))
ymod = np.array(ymod)
r = conc_produit - ymod
list_para = list()
# loop over n bootstrap samples from the resids
for i in range(nboot):
pc, pout = curve_fit(model, local_scaled_t, ymod + r[bootindex(0, len(r)-1, len(r))], maxfev=3000)
pc = [ pc[0] / 3600, pc[1] / 3600 , pc[2] / 3600 ]
list_para.append(pc)
ymod = list()
for each in local_t:
ymod.append(model(each, pc[0], pc[1], pc[2]))
ymod = np.array(ymod)
list_para = np.array(list_para)
mean_params = np.mean(list_para,0)
std_params = np.std(list_para,0)
print(popt)
for true_para, para, std in zip(popt, mean_params, std_params):
print("{0} between {1} and {2}".format(round(true_para, 6), round(para - std * 1.95996, 6), round(para + std * 1.95996, 6)))
print("{0} between {1} and {2}".format(round(true_para, 6), round(para - std * 1.95996, 6), round(para + std * 1.95996, 6)))
Nothing complicated here, just note that I rescale the time to normalize my data and have better parameters.
And finally, here are 2 outputs, for the same code:
[1.9023455671995163e-05, 0.01275941716148471, 0.026540319119773129]
1.9e-05 between 1.6e-05 and 2.1e-05
0.012759 between -0.042697 and 0.092152
0.02654 between -0.073456 and 0.159983
[1.9023455671995163e-05, 0.01275941716148471, 0.026540319119773129]
1.9e-05 between 1.5e-05 and 2.9e-05
0.012759 between -0.116499 and 0.17112
0.02654 between -0.186011 and 0.27797
As you can see, the differences are pretty huge. Is that expected or am I doing something wrong ? For an example, I don't really undestand why I have to multiply by 1.95996 the standard deviation.
Upvotes: 4
Views: 4317
Reputation: 54340
Your curve_fit
already gave you the covariance matrix, that is the pout
. The 95% confidence limit for your ith parameter is: pc[i]-1.95596*sqrt(pout[i,i])
and pc[i]+1.95596*sqrt(pout[i,i])
. 1.95596 is the x, such that cumulative distribution function of the standard normal distribution F(x)=0.975. You can get confidence interval of other levels by using scipy.stats.norm.ppf
. See wiki: http://en.wikipedia.org/wiki/1.96
Bootstrap is not going to give you the same (or, sometimes, even close) answers each time you run it. For your specific function, the very few early data points have very big influence on the fit Solve equation with a set of points. I am not sure whether bootstrap is the way to go as if the very few early data points are not sampled, the fit will be very different from the fit of your original data. That also explains why your bootstrap intervals are so difference form each other.
Upvotes: 2