rwolst
rwolst

Reputation: 13672

Sampling from within Pandas groups with defined probabilities

Consider the following Pandas dataframe,

df = pd.DataFrame(
    [
         ['X', 0, 0.5],
         ['X', 1, 0.5],

         ['Y', 0, 0.25],
         ['Y', 1, 0.3],
         ['Y', 2, 0.45],

         ['Z', 0, 0.6],
         ['Z', 1, 0.1],
         ['Z', 2, 0.3]
    ], columns=['NAME', 'POSITION', 'PROB'])

Notice that df defines a discrete probability distribution for each unique NAME value i.e.

assert ((df.groupby('NAME')['PROB'].sum() - 1)**2 < 1e-10).all()

What I would like to do is sample from these probability distributions.

We can think of POSITION as being the values corresponding to the probabilities. So when considering X the sample will be 0 with probability 0.5 and 1 with probability 0.5.

I would like to create a new dataframe with columns ['NAME', 'POSITION', 'PROB', 'SAMPLE'] representing these samples. Each unique SAMPLE value represents a new sample. The PROB column is now always 0 or 1, representing whether the given row was selected in the given sample. For example, if I were to select 3 samples an example outcome is below,

df_samples = pd.DataFrame(
    [
         ['X', 0, 1, 0],
         ['X', 1, 0, 0],
         ['X', 0, 0, 1],
         ['X', 1, 1, 1],
         ['X', 0, 1, 2],
         ['X', 1, 0, 2],

         ['Y', 0, 1, 0],
         ['Y', 1, 0, 0],
         ['Y', 2, 0, 0],
         ['Y', 0, 0, 1],
         ['Y', 1, 0, 1],
         ['Y', 2, 1, 1],
         ['Y', 0, 1, 2],
         ['Y', 1, 0, 2],
         ['Y', 2, 0, 2],

         ['Z', 0, 0, 0],
         ['Z', 1, 0, 0],
         ['Z', 2, 1, 0],
         ['Z', 0, 0, 1],
         ['Z', 1, 0, 1],
         ['Z', 2, 1, 1],
         ['Z', 0, 1, 2],
         ['Z', 1, 0, 2],
         ['Z', 2, 0, 2],
    ], columns=['NAME', 'POSITION', 'PROB', 'SAMPLE'])

Of course due to the randomness involved, this is just one of a number of possible outcomes.

A unittest for the program would be that as the samples increases, by the law of large numbers, the mean number of our samples for each (NAME, POSITION) pair, should tend to the actual probability. One could calculate a confidence region based on the total samples used and then make sure the true probability lies within it. For example using a normal approximation to binomial outcomes (requires total samples n_samples to be 'large') a (-4 sd, 4 sd) region test would be:

z = 4

p_est = df_samples.groupby(['NAME', 'POSITION'])['PROB'].mean()
p_true = df.set_index(['NAME', 'POSITION'])['PROB']

CI_lower = p_est - z*np.sqrt(p_est*(1-p_est)/n_samples)
CI_upper = p_est + z*np.sqrt(p_est*(1-p_est)/n_samples)

assert p_true < CI_upper
assert p_true > CI_lower

What is the most efficient way to do this in Pandas? I feel like I want to apply some sample function to the df.groupby('NAME') object.

P.S.

To be even more explicit, here is a very long winded way of doing this using Numpy.

n_samples = 3
df_list = []
for name in ['X', 'Y', 'Z']:
    idx = df['NAME'] == name
    position_samples = np.random.choice(df.loc[idx, 'POSITION'], 
                                        n_samples, 
                                        p=df.loc[idx, 'PROB'])
    prob = np.zeros([idx.sum(), n_samples])
    prob[position_samples, np.arange(n_samples)] = 1
    position = np.tile(np.arange(idx.sum())[:, None], n_samples)
    sample = np.tile(np.arange(n_samples)[:,None], idx.sum()).T

    df_list.append(pd.DataFrame(
        [[name, prob.ravel()[i], position.ravel()[i], 
          sample.ravel()[i]] 
         for i in range(n_samples*idx.sum())], 
        columns=['NAME', 'PROB', 'POSITION', 'SAMPLE']))

df_samples = pd.concat(df_list)

Upvotes: 1

Views: 780

Answers (1)

Ami Tavory
Ami Tavory

Reputation: 76297

If I understand correctly, you're looking for groupby + sample and then some indexing stuff

First sample by the probabilites:

n_samples = 3
df_samples = df.groupby('NAME').apply(lambda x: x[['NAME', 'POSITION']] \
                               .sample(n_samples, replace=True,
                                       weights=x.PROB)) \
                               .reset_index(drop=True)

Now add the extra columns:

df_samples['SAMPLE'] = df_samples.groupby('NAME').cumcount()
df_samples['PROB'] = 1


print(df_samples)

  NAME  POSITION  SAMPLE  PROB
0    X         1       0     1
1    X         0       1     1
2    X         1       2     1
3    Y         1       0     1
4    Y         1       1     1
5    Y         1       2     1
6    Z         2       0     1
7    Z         0       1     1
8    Z         0       2     1

Note that this doesn't include the 0 probability positions for each sample as requested in the initial question but it is a more concise way of storing the information.

If we want to also include the 0 probability positions we can merge in the other positions as follows:

domain = df[['NAME', 'POSITION']].drop_duplicates()
df_samples.drop('PROB', axis=1, inplace=True)
df_samples = pd.merge(df_samples, domain, on='NAME', 
                      suffixes=['_sample', ''])
df_samples['PROB'] = (df_samples['POSITION'] ==
                     df_samples['POSITION_sample']).astype(int)
df_samples.drop('POSITION_sample', axis=1, inplace=True)

Upvotes: 2

Related Questions