Reputation: 35
I have a dataframe (length 4 data points) and want to do a Bootstrap X times.
DATA FRAME EXAMPLE:
Index A B
0 1 2
1 1 2
2 1 2
3 1 2
I figured out this code for the Bootstrap Resampling
boot = resample(df, replace=True, n_samples=len(df), random_state=1)
print('Bootstrap Sample: %s' % boot)
but now i like to repeat this X times. How can i do this?
output for x=20.
Sample Nr. Index A B
1 0 1 2
1 1 2
2 1 2
3 1 2
...
20 0 1 2
1 1 2
1 1 2
2 1 2
Thank you guys.
Best
Upvotes: 1
Views: 1065
Reputation: 793
Just want to add another approach that uses numpy.random.Generator.choice. This approach will work whether your data is a numpy array or pandas dataframe.
Using the sample of data your provided
df = pd.DataFrame({'index': [0, 1, 2, 3],
'A': [1, 1, 1, 1],
'B': [2, 2, 2, 2]})
df
Here is how I would do it with using the numpy approach
rng = np.random.default_rng()
def simple_bootstrap(data, replace=True, replicates=5, random_state=None, shuffle=True):
def simple_resample(data, size=len(data), replace=replace, shuffle=shuffle, axis=0):
return rng.choice(a=data, size=size, axis=axis)
return [simple_resample(data) for _ in range(replicates)]
When I call the function on my df
like below, it gives me 4 random selections from my data
simple_bootstrap(df)
[array([[1, 1, 2],
[2, 1, 2],
[0, 1, 2],
[3, 1, 2]], dtype=int64),
array([[0, 1, 2],
[1, 1, 2],
[1, 1, 2],
[3, 1, 2]], dtype=int64),
array([[3, 1, 2],
[1, 1, 2],
[1, 1, 2],
[2, 1, 2]], dtype=int64),
array([[3, 1, 2],
[1, 1, 2],
[3, 1, 2],
[3, 1, 2]], dtype=int64),
array([[0, 1, 2],
[3, 1, 2],
[3, 1, 2],
[3, 1, 2]], dtype=int64)]
Remember, although I asked for replicates=5
, it got 4 random samples, because If a has more than one dimension, the size shape will be inserted into the axis dimension, so the output ndim will be a.ndim - 1 + len(size).
You could also extend your bootstrap function to include a statistical function that runs over each replication and saves it into a list, like the example below:
def simple_bootstrap(data, statfunction, replace=True, replicates=5, random_state=None, shuffle=True):
def simple_resample(data, size=len(data), replace=replace, shuffle=shuffle, axis=0):
return rng.choice(a=data, size=size, axis=axis)
resample_estimates = [statfunction(simple_resample(data)) for _ in range(replicates)]
return resample_estimates
Upvotes: 0
Reputation: 6667
As it could be time consuming to be calling n
time the sample method of a dataframe, one can consider to apply the sample
method parallely.
import multiprocessing
from itertools import repeat
def sample_data(df, replace, random_state):
'''Generate one sample of size len(df)'''
return df.sample(replace=replace, n=len(df), random_state=random_state)
def resample_data(df, replace, n_samples, random_state):
'''Call n_samples time the sample method parallely'''
# Invoke lambda in parallel
pool = multiprocessing.Pool(multiprocessing.cpu_count())
bootstrap_samples = pool.starmap(sample_data, zip(repeat(df, n_samples), repeat(replace), repeat(random_state)))
pool.close()
pool.join()
return bootstrap_samples
Now, if I want to generate 15 samples, resample_data
will return me a list with 15 samples from the df
.
samples = resample_data(df, True, n_samples=15, random_state=1)
Notice that to return different results it will be convenient to set random_state
to None
.
Another approach to sample data is through a list comprehension, as the function sample_data
is already defined, it is straightforward to call it inside the list.
def resample_data_linearly(df, replace, n_samples, random_state):
return [sample_data(df, replace, random_state) for _ in range(n_samples)]
# Generate 10 samples of size len(df)
samples = resample_data_linearly(df, True, n_samples=10, random_state=1)
Upvotes: 2