Reputation: 387
I'm trying to calculate the confidence interval for the mean value using the method of bootstrap in python. Let say I have a vector a with 100 entries and my aim is to calculate the mean value of these 100 values and its 95% confidence interval using bootstrap. So far I have manage to resample 1000 times from my vector using the np.random.choice function. Then for each bootstrap vector with 100 entries I calculated the mean. So now I have 1000 bootstrap mean values and a single sample mean value from my initial vector but I'm not sure how to proceed from here. How could I use these mean values to find the confidence interval for the mean value of my initial vector? I'm relatively new in python and it's the first time I came across with the method of bootstrap so any help would be much appreciated.
Upvotes: 11
Views: 19456
Reputation: 57
I have a simple statistical solution : Confidence intervals are based on the standard error. The standard error in your case is the standard deviation of your 1000 bootstrap means. Assuming a normal distribution of the sampling distribution of your parameter(mean), which should be warranted by the properties of the Central Limit Theorem, just multiply the equivalent z-score of the desired confidence interval with the standard deviation. Therefore:
lower boundary = mean of your bootstrap means - 1.96 * std. dev. of your bootstrap means
upper boundary = mean of your bootstrap means + 1.96 * std. dev. of your bootstrap means
95% of cases in a normal distribution sit within 1.96 standard deviations from the mean
hope this helps
Upvotes: 4
Reputation: 721
First I suggest you to deeper your understanding regarding the bootstrapping method and it usage, the main idea is to handle a situation of a lack in a data and you want reproduce more of it.
Second, regarding the confidence interval you can use the Wilson Score Interval which aims to help you rank binomial models. I found this Ipython notebook which explains what you asked for
A short example of wilson interval
import math
def ci(positive, n, z):
# z = 1.96
phat = positive / n
return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n), \
(phat + z * z / (2 * n) + z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)
sample_size = [50, 100, 200, 400, 8000]
z_rate_confidence = {'95%': 1.96, '90%': 1.92, '75%': 1.02}
success_rate = [0.6, 0.7, 0.8]
for confidence, z in z_rate_confidence.iteritems():
print 'confidence: '+confidence + '\n'
for n in sample_size:
print 'sample size: ',n
for s in success_rate:
print ci(s * n, n, z)
Upvotes: -1
Reputation: 8781
You could sort the array of 1000 means and use the 50th and 950th elements as the 90% bootstrap confidence interval.
Your set of 1000 means is basically a sample of the distribution of the mean estimator (the sampling distribution of the mean). So, any operation you could do on a sample from a distribution you can do here.
Upvotes: 9