TheRealSanity
TheRealSanity

Reputation: 35

How to repeat a certain command (BOOTSTRAP RESAMPLING) with Python

I have a dataframe (length 4 data points) and want to do a Bootstrap X times.

DATA FRAME EXAMPLE:

              Index A B
                0   1 2
                1   1 2
                2   1 2
                3   1 2 

I figured out this code for the Bootstrap Resampling

      boot = resample(df, replace=True, n_samples=len(df), random_state=1)
      print('Bootstrap Sample: %s' % boot)

but now i like to repeat this X times. How can i do this?

output for x=20.

  Sample Nr.    Index A B
      1         0   1 2
                1   1 2
                2   1 2
                3   1 2 
     ...
      20        0   1 2
                1   1 2
                1   1 2
                2   1 2   

Thank you guys.

Best

Upvotes: 1

Views: 1065

Answers (2)

GSA
GSA

Reputation: 793

Just want to add another approach that uses numpy.random.Generator.choice. This approach will work whether your data is a numpy array or pandas dataframe.

Using the sample of data your provided

df = pd.DataFrame({'index': [0, 1, 2, 3],
                  'A': [1, 1, 1, 1],
                  'B': [2, 2, 2, 2]})
df

Here is how I would do it with using the numpy approach

rng = np.random.default_rng()

def simple_bootstrap(data, replace=True, replicates=5, random_state=None, shuffle=True):
    def simple_resample(data, size=len(data), replace=replace, shuffle=shuffle, axis=0):
        return rng.choice(a=data, size=size, axis=axis)
    return [simple_resample(data) for _ in range(replicates)]

When I call the function on my df like below, it gives me 4 random selections from my data

simple_bootstrap(df)

[array([[1, 1, 2],
        [2, 1, 2],
        [0, 1, 2],
        [3, 1, 2]], dtype=int64),
 array([[0, 1, 2],
        [1, 1, 2],
        [1, 1, 2],
        [3, 1, 2]], dtype=int64),
 array([[3, 1, 2],
        [1, 1, 2],
        [1, 1, 2],
        [2, 1, 2]], dtype=int64),
 array([[3, 1, 2],
        [1, 1, 2],
        [3, 1, 2],
        [3, 1, 2]], dtype=int64),
 array([[0, 1, 2],
        [3, 1, 2],
        [3, 1, 2],
        [3, 1, 2]], dtype=int64)]

Remember, although I asked for replicates=5, it got 4 random samples, because If a has more than one dimension, the size shape will be inserted into the axis dimension, so the output ndim will be a.ndim - 1 + len(size).

You could also extend your bootstrap function to include a statistical function that runs over each replication and saves it into a list, like the example below:

def simple_bootstrap(data, statfunction, replace=True, replicates=5, random_state=None, shuffle=True):
    def simple_resample(data, size=len(data), replace=replace, shuffle=shuffle, axis=0):
        return rng.choice(a=data, size=size, axis=axis)
    resample_estimates = [statfunction(simple_resample(data)) for _ in range(replicates)]
    return resample_estimates

Upvotes: 0

Miguel Trejo
Miguel Trejo

Reputation: 6667

Approach 1: Sample Data Parallely

As it could be time consuming to be calling n time the sample method of a dataframe, one can consider to apply the sample method parallely.

import multiprocessing
from itertools import repeat

def sample_data(df, replace, random_state):
    '''Generate one sample of size len(df)'''
    return df.sample(replace=replace, n=len(df), random_state=random_state)

def resample_data(df, replace, n_samples, random_state):
    '''Call n_samples time the sample method parallely'''
    
    # Invoke lambda in parallel
    pool = multiprocessing.Pool(multiprocessing.cpu_count())
    bootstrap_samples = pool.starmap(sample_data, zip(repeat(df, n_samples), repeat(replace), repeat(random_state)))
    pool.close()
    pool.join()

    return bootstrap_samples

Now, if I want to generate 15 samples, resample_data will return me a list with 15 samples from the df.

samples = resample_data(df, True, n_samples=15, random_state=1)

Notice that to return different results it will be convenient to set random_state to None.

Approach 2: Sample Data Linearly

Another approach to sample data is through a list comprehension, as the function sample_data is already defined, it is straightforward to call it inside the list.

def resample_data_linearly(df, replace, n_samples, random_state):
    
    return [sample_data(df, replace, random_state) for _ in range(n_samples)] 

# Generate 10 samples of size len(df)
samples = resample_data_linearly(df, True, n_samples=10, random_state=1)

Upvotes: 2

Related Questions