Reputation: 133
I have a large dataset, with on the order of 2^15 entries, and I calculate the confidence interval of the mean of the entries with scipy.stats.bootstrap
. For a dataset this size, this costs about 6 seconds on my laptop. I have a lot of datasets, so I find this takes too long (especially if I just want to do a test run to debug the plotting etc.). By default, SciPy's bootstrapping function resamples the data n_resamples=9999
times. As I understand it, the resampling and computing the average of the resampled data should be the most time-consuming part of the process. However, when I reduce the number of resamples by roughly three orders of magnitude (n_resamples=10
), the computational time of the bootstrapping method does not even half.
I'm using Python 3 and SciPy 1.9.3.
import numpy as np
from scipy import stats
from time import perf_counter
data = np.random.rand(2**15)
data = np.array([data])
start = perf_counter()
bs = stats.bootstrap(data, np.mean, batch=1, n_resamples=9999)
end = perf_counter()
print(end-start)
start = perf_counter()
bs = stats.bootstrap(data, np.mean, batch=1, n_resamples=10)
end = perf_counter()
print(end-start)
start = perf_counter()
bs = stats.bootstrap(data, np.mean, n_resamples=10)
end = perf_counter()
print(end-start)
gives
6.021066904067993
3.9989020824432373
30.46708607673645
To speed up bootstrapping, I have set batch=1
. As I understand it, this is more memory efficient, and prevents swapping the data. Setting a higher batch number increases the time consumption, as you can see above.
How can I do faster bootstrapping?
Upvotes: 2
Views: 1794
Reputation: 3873
TLDR: Try passing method='basic'
or method='percentile
'. Also, pass a NumPy Generator
for the random state (e.g. random_state=np.random.default_rng()
).
import numpy as np
from scipy import stats
from time import perf_counter
rng = np.random.default_rng(234893458942534)
data = np.random.rand(2**15)
data = np.array([data])
start = perf_counter()
bs = stats.bootstrap(data, np.mean, batch=1, n_resamples=10)
end = perf_counter()
print(end-start) # 13.304466278000064
start = perf_counter()
bs = stats.bootstrap(data, np.mean, batch=None, n_resamples=10,
method='basic', random_state=rng)
end = perf_counter()
print(end-start) # 0.0040362729999969815
This should free up enough time to do more resamples.
The default method='BCa'
. This is the bias-corrected and accelerated bootstrap, which performs the jackknife to estimate the acceleration parameter. The jackknife evaluates the statistic with N jackknife resamples, where N is the number of observations in the sample; each is your original sample with one observation removed. This explains why reducing the number of bootstrap resamples does not reduce the execution time much: it's dominated by the jackknife. The other methods don't need to do the jackknife.
Passing the Generator
for the random state boosts the speed a bit because it's faster at generating the resamples than the old RandomState
(default).
Setting a higher batch number increases the time consumption, as you can see above.
When the sample size is large, sure, but not in general. For smaller samples with a vectorized statistic function like np.mean
, leaving batch=None
or using a moderate value (e.g. batch=100
) tends to be much faster than batch=1
because less Python looping is required.
Upvotes: 2
Reputation: 2793
I had the same problem for sample sizes > 20k. First memory problems, and with batch_size=1 for 25k it did not even return.
I needed up rewriting the re-sampling part of bootstrap() and copying some code out of it:
def bootstrap_test_fast(y_true, y_pred, statistic=cal_recall, n_resamples=1000):
rng = np.random.default_rng()
confidence_level=0.95
alternative='two-sided'
data = np.stack((y_true, y_pred))
ndata = len(y_true)
theta_hat_b = []
for n in range(1000):
resampled = rng.choice(a=data, size=ndata, replace=True, axis=1)
theta_hat_b.append(statistic(resampled[0], resampled[1]))
alpha = ((1 - confidence_level) / 2 if alternative == 'two-sided' else (1 - confidence_level))
interval = alpha, 1 - alpha
def percentile_fun(a, q):
return np.percentile(a=a, q=q, axis=-1)
ci_l = percentile_fun(theta_hat_b, interval[0] * 100)
ci_u = percentile_fun(theta_hat_b, interval[1] * 100)
theta_hat = statistic(np.array(y_true), np.array(y_pred))
ci_l, ci_u = 2 * theta_hat - ci_u, 2 * theta_hat - ci_l
return stats._resampling.BootstrapResult( confidence_interval = stats._resampling.ConfidenceInterval(ci_l, ci_u),
bootstrap_distribution=theta_hat_b,
standard_error=np.std(theta_hat_b, ddof=1, axis=-1))
Code is much faster. Sampling 200k samples with N=1000 takes about 15 seconds.
Upvotes: -1