Reputation: 2012
I have some data called variable
that I can simulate in the following way:
import pandas as pd
import numpy as np
from scipy.stats import bootstrap
random_values = np.random.uniform(low=0, high=10, size=10000)
df = pd.DataFrame({"variable": random_values})
I want to bin my data in 5 bins within the bins bins = [0, 2, 4, 6, 8, 10]
and calculate to each of the bins error-bars with some bootstrapping method, e.g. the confidence interval of the 95% percent level. I figured out that the cumbersome thing is to calculate the error bars. I could do it with scipy.stats.bootstrap
and then do
bootstrap(one_of_the_bins, my_statistic, confidence_level=0.95, method='percentile')
but it requires that I split my data into chunks according to the bins and loop over the chunks. So I wonder is there a more convenient way to do this, is there some functionality integrated in pandas for that? Or can I provide to scipy.stats
my full data and the bins and then scipy will do the calculations for all the bins together? Thank you for any advice!
Upvotes: 0
Views: 190
Reputation: 2012
I created this code on my own that does the job:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bootstrap
def my_statistic(data):
y,binEdges = np.histogram(data, bins=bins)
return y
data = np.random.uniform(low=0, high=10, size=10000)
bins = np.array([0,2,4,6,8,10])
y,binEdges = np.histogram(data, bins=bins)
bincenters = 0.5*(binEdges[1:]+binEdges[:-1])
width = abs(binEdges[1]-binEdges[0])
bootstrap_err = bootstrap((data,), my_statistic, confidence_level=0.68, method='percentile')
# below we plot a density (normalized) histogram of the distribution with errorbars
plt.errorbar(bincenters, y/len(data)/width, yerr = [np.abs(y-bootstrap_err.confidence_interval[0])/len(data)/width, np.abs(bootstrap_err.confidence_interval[1]-y)/len(data)/width], marker = '.', drawstyle = 'steps-mid')
plt.grid()
If you don't want to density normalize your histogram, just get rid of /len(data)/width
wherever it was used. This will give you absolute counts and absolute errors. It is to be noted, that in this case, where we use bootstrapping and a confidence level of 0.68, the calculated error is virtually equal to the Poisson error (yerr=np.sqrt(y)
). At least this is what 2 tests of mine show.
Upvotes: 0
Reputation: 3803
It can do the calculation with all the bins together:
import numpy as np
from scipy.stats import bootstrap, binned_statistic
x = np.random.uniform(low=0, high=10, size=10000)
def statistic(x):
res = binned_statistic(x, x, statistic='mean',
bins=[0, 2, 4, 6, 8, 10])
return res.statistic
res = bootstrap((x,), statistic)
res.confidence_interval
# ConfidenceInterval(low=array([0.99653325, 2.969743 , 4.99033544, 6.98312963, 8.9515727 ]),
# high=array([1.04843922, 3.02077679, 5.04092083, 7.03426957, 9.0015762 ]))
Note the theoretical difference between
You can more easily visualize the difference if the statistic
parameter of binned_statistic
were count
. Suppose your observed data is binned with the following counts per bin:
binned_statistic(x, x, statistic='count', bins=[0, 2, 4, 6, 8, 10])
# BinnedStatisticResult(statistic=array([1917., 2013., 1998., 2005., 2067.]) ...
If you were to pre-bin the data, then there would be no uncertainty in the counts in each bin. On the other hand, if you include the binning as part of the statistic, you might get something like:
res.confidence_interval
# ConfidenceInterval(low=array([1841., 1936., 1921., 1929., 1990.]),
# high=array([1993., 2091., 2078., 2084.05, 2147.]))
Upvotes: 1