NeStack
NeStack

Reputation: 2012

Get the errorbars to bins of a dataset by using bootstrapping

I have some data called variable that I can simulate in the following way:

import pandas as pd
import numpy as np
from scipy.stats import bootstrap

random_values = np.random.uniform(low=0, high=10, size=10000)
df = pd.DataFrame({"variable": random_values})

I want to bin my data in 5 bins within the bins bins = [0, 2, 4, 6, 8, 10] and calculate to each of the bins error-bars with some bootstrapping method, e.g. the confidence interval of the 95% percent level. I figured out that the cumbersome thing is to calculate the error bars. I could do it with scipy.stats.bootstrap and then do

bootstrap(one_of_the_bins, my_statistic, confidence_level=0.95, method='percentile')

but it requires that I split my data into chunks according to the bins and loop over the chunks. So I wonder is there a more convenient way to do this, is there some functionality integrated in pandas for that? Or can I provide to scipy.stats my full data and the bins and then scipy will do the calculations for all the bins together? Thank you for any advice!

Upvotes: 0

Views: 190

Answers (2)

NeStack
NeStack

Reputation: 2012

I created this code on my own that does the job:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bootstrap

def my_statistic(data):
    y,binEdges = np.histogram(data, bins=bins)
    return y

data = np.random.uniform(low=0, high=10, size=10000)
bins = np.array([0,2,4,6,8,10])
y,binEdges = np.histogram(data, bins=bins)
bincenters = 0.5*(binEdges[1:]+binEdges[:-1])
width      = abs(binEdges[1]-binEdges[0])

bootstrap_err = bootstrap((data,), my_statistic, confidence_level=0.68, method='percentile')

# below we plot a density (normalized) histogram of the distribution with errorbars
plt.errorbar(bincenters, y/len(data)/width, yerr = [np.abs(y-bootstrap_err.confidence_interval[0])/len(data)/width, np.abs(bootstrap_err.confidence_interval[1]-y)/len(data)/width], marker = '.', drawstyle = 'steps-mid')
plt.grid()

a plot of the calculated histogram

If you don't want to density normalize your histogram, just get rid of /len(data)/width wherever it was used. This will give you absolute counts and absolute errors. It is to be noted, that in this case, where we use bootstrapping and a confidence level of 0.68, the calculated error is virtually equal to the Poisson error (yerr=np.sqrt(y)). At least this is what 2 tests of mine show.

Upvotes: 0

Matt Haberland
Matt Haberland

Reputation: 3803

It can do the calculation with all the bins together:

import numpy as np
from scipy.stats import bootstrap, binned_statistic

x = np.random.uniform(low=0, high=10, size=10000)

def statistic(x):
    res = binned_statistic(x, x, statistic='mean',
                           bins=[0, 2, 4, 6, 8, 10])
    return res.statistic

res = bootstrap((x,), statistic)
res.confidence_interval
# ConfidenceInterval(low=array([0.99653325, 2.969743  , 4.99033544, 6.98312963, 8.9515727 ]), 
#                    high=array([1.04843922, 3.02077679, 5.04092083, 7.03426957, 9.0015762 ]))

Note the theoretical difference between

  • pre-binning the data, then computing a statistic, VS
  • including binning as part of the bootstrapped statistic.

You can more easily visualize the difference if the statistic parameter of binned_statistic were count. Suppose your observed data is binned with the following counts per bin:

binned_statistic(x, x, statistic='count', bins=[0, 2, 4, 6, 8, 10])
# BinnedStatisticResult(statistic=array([1917., 2013., 1998., 2005., 2067.]) ... 

If you were to pre-bin the data, then there would be no uncertainty in the counts in each bin. On the other hand, if you include the binning as part of the statistic, you might get something like:

res.confidence_interval
# ConfidenceInterval(low=array([1841., 1936., 1921., 1929., 1990.]), 
#                    high=array([1993., 2091., 2078., 2084.05, 2147.]))

Upvotes: 1

Related Questions