Selecting random values from dataframe without replacement

Question

I am following the answer from the link:

If I have a dataframe df as:

Month   Day     mnthShape
1      1    1.01
1      1    1.09
1      1    0.96
1      2    1.01
1      1    1.09
1      2    0.96
1      3    1.01
1      3    1.09
1      3    1.78

I want to get the following from df:

Month   Day mnthShape
1       1   1.01
1       2   1.01
1       1   0.96

where the mnthShape values are selected at random from the index without replacement. i.e. if the query is df.loc[(1, 1)] it should look for all values for (1, 1) and select randomly from it a value to be displayed above. If another df.loc[(1,1)] appears it should select randomly again but without replacement.

I know I need to modify the code to use the following:

apply(np.random.choice, replace=False)

But not sure how to do it.

Edit: Everytime I do df.loc[(1, 1)], it should give new value without replacement. I intend to do df.loc[(1, 1)] multiple times. In the previous question, it was just one time.

Michael Delgado · Accepted Answer

If you're trying to sample from the dataset without replacement, it probably makes sense to do this all in one go, rather than iteratively pulling a sample from the dataset.

Pulling N samples from each month/day combo requires that there be sufficient combinations to pull N without replacement. But assuming this is true, you could write a function to sample N values from a subset of the data:

def select_n(subset, n=2):
    choices = np.random.choice(len(x), size=n, replace=False)
    return (
        subset
        .mnthShape
        .iloc[choices]
        .reset_index(drop=True)
        .rename_axis('choice'))

to apply this across the whole dataset:

In [34]: df.groupby(['Month', 'Day']).apply(select_n)
Out[34]:
choice        0     1
Month Day
1     1    1.09  0.96
      2    0.96  1.01
      3    1.09  1.01

If you really need to pull these one at a time, you'll still need to generate the samples all at once to guarantee that they're drawn without replacement, but you could generate the sample indices separately from subsetting the data:

In [48]: indices = np.random.choice(3, size=2, replace=False)

In [49]: df[((df.Month == 1) & (df.Day == 2))].iloc[indices[0]]
Out[49]:
Month        1.00
Day          2.00
mnthShape    1.01
Name: 3, dtype: float64

In [50]: df[((df.Month == 1) & (df.Day == 2))].iloc[indices[1]]
Out[50]:
Month        1.00
Day          2.00
mnthShape    0.96
Name: 5, dtype: float64

Selecting random values from dataframe without replacement

Answers (1)

Related Questions