Georg
Georg

Reputation: 133

Time consumption of SciPy's bootstrap as a function of the number of resamples

I have a large dataset, with on the order of 2^15 entries, and I calculate the confidence interval of the mean of the entries with scipy.stats.bootstrap. For a dataset this size, this costs about 6 seconds on my laptop. I have a lot of datasets, so I find this takes too long (especially if I just want to do a test run to debug the plotting etc.). By default, SciPy's bootstrapping function resamples the data n_resamples=9999 times. As I understand it, the resampling and computing the average of the resampled data should be the most time-consuming part of the process. However, when I reduce the number of resamples by roughly three orders of magnitude (n_resamples=10), the computational time of the bootstrapping method does not even half.

I'm using Python 3 and SciPy 1.9.3.

import numpy as np
from scipy import stats
from time import perf_counter

data = np.random.rand(2**15)
data = np.array([data])

start = perf_counter()
bs = stats.bootstrap(data, np.mean, batch=1, n_resamples=9999)
end = perf_counter()
print(end-start)

start = perf_counter()
bs = stats.bootstrap(data, np.mean, batch=1, n_resamples=10)
end = perf_counter()
print(end-start)

start = perf_counter()
bs = stats.bootstrap(data, np.mean, n_resamples=10)
end = perf_counter()
print(end-start)

gives

6.021066904067993
3.9989020824432373
30.46708607673645

To speed up bootstrapping, I have set batch=1. As I understand it, this is more memory efficient, and prevents swapping the data. Setting a higher batch number increases the time consumption, as you can see above.

How can I do faster bootstrapping?

Upvotes: 2

Views: 1794

Answers (2)

Matt Haberland
Matt Haberland

Reputation: 3873

TLDR: Try passing method='basic' or method='percentile'. Also, pass a NumPy Generator for the random state (e.g. random_state=np.random.default_rng()).

import numpy as np
from scipy import stats
from time import perf_counter
rng = np.random.default_rng(234893458942534)

data = np.random.rand(2**15)
data = np.array([data])

start = perf_counter()
bs = stats.bootstrap(data, np.mean, batch=1, n_resamples=10)
end = perf_counter()
print(end-start)  # 13.304466278000064

start = perf_counter()
bs = stats.bootstrap(data, np.mean, batch=None, n_resamples=10, 
                     method='basic', random_state=rng)
end = perf_counter()
print(end-start)  # 0.0040362729999969815

This should free up enough time to do more resamples.


The default method='BCa'. This is the bias-corrected and accelerated bootstrap, which performs the jackknife to estimate the acceleration parameter. The jackknife evaluates the statistic with N jackknife resamples, where N is the number of observations in the sample; each is your original sample with one observation removed. This explains why reducing the number of bootstrap resamples does not reduce the execution time much: it's dominated by the jackknife. The other methods don't need to do the jackknife.

Passing the Generator for the random state boosts the speed a bit because it's faster at generating the resamples than the old RandomState (default).

Setting a higher batch number increases the time consumption, as you can see above.

When the sample size is large, sure, but not in general. For smaller samples with a vectorized statistic function like np.mean, leaving batch=None or using a moderate value (e.g. batch=100) tends to be much faster than batch=1 because less Python looping is required.

Upvotes: 2

Stiefel
Stiefel

Reputation: 2793

I had the same problem for sample sizes > 20k. First memory problems, and with batch_size=1 for 25k it did not even return.

I needed up rewriting the re-sampling part of bootstrap() and copying some code out of it:

def bootstrap_test_fast(y_true, y_pred, statistic=cal_recall, n_resamples=1000):
    rng = np.random.default_rng()
    confidence_level=0.95
    alternative='two-sided'  
    data = np.stack((y_true, y_pred))

    ndata = len(y_true)
    theta_hat_b = []
    for n in range(1000):
        resampled = rng.choice(a=data, size=ndata, replace=True, axis=1)
        theta_hat_b.append(statistic(resampled[0], resampled[1]))


    alpha = ((1 - confidence_level) / 2 if alternative == 'two-sided' else (1 - confidence_level))
    interval = alpha, 1 - alpha

    def percentile_fun(a, q):
        return np.percentile(a=a, q=q, axis=-1)

    ci_l = percentile_fun(theta_hat_b, interval[0] * 100)
    ci_u = percentile_fun(theta_hat_b, interval[1] * 100)
    theta_hat = statistic(np.array(y_true), np.array(y_pred))
    ci_l, ci_u = 2 * theta_hat - ci_u, 2 * theta_hat - ci_l

    return stats._resampling.BootstrapResult( confidence_interval = stats._resampling.ConfidenceInterval(ci_l, ci_u),
                       bootstrap_distribution=theta_hat_b,
                       standard_error=np.std(theta_hat_b, ddof=1, axis=-1))

Code is much faster. Sampling 200k samples with N=1000 takes about 15 seconds.

Upvotes: -1

Related Questions