Reputation: 13672
Consider the following Pandas dataframe,
df = pd.DataFrame(
[
['X', 0, 0.5],
['X', 1, 0.5],
['Y', 0, 0.25],
['Y', 1, 0.3],
['Y', 2, 0.45],
['Z', 0, 0.6],
['Z', 1, 0.1],
['Z', 2, 0.3]
], columns=['NAME', 'POSITION', 'PROB'])
Notice that df
defines a discrete probability distribution for each unique NAME
value i.e.
assert ((df.groupby('NAME')['PROB'].sum() - 1)**2 < 1e-10).all()
What I would like to do is sample from these probability distributions.
We can think of POSITION
as being the values corresponding to the probabilities. So when considering X
the sample will be 0
with probability 0.5
and 1
with probability 0.5
.
I would like to create a new dataframe with columns ['NAME', 'POSITION', 'PROB', 'SAMPLE']
representing these samples. Each unique SAMPLE
value represents a new sample. The PROB
column is now always 0 or 1, representing whether the given row was selected in the given sample. For example, if I were to select 3 samples an example outcome is below,
df_samples = pd.DataFrame(
[
['X', 0, 1, 0],
['X', 1, 0, 0],
['X', 0, 0, 1],
['X', 1, 1, 1],
['X', 0, 1, 2],
['X', 1, 0, 2],
['Y', 0, 1, 0],
['Y', 1, 0, 0],
['Y', 2, 0, 0],
['Y', 0, 0, 1],
['Y', 1, 0, 1],
['Y', 2, 1, 1],
['Y', 0, 1, 2],
['Y', 1, 0, 2],
['Y', 2, 0, 2],
['Z', 0, 0, 0],
['Z', 1, 0, 0],
['Z', 2, 1, 0],
['Z', 0, 0, 1],
['Z', 1, 0, 1],
['Z', 2, 1, 1],
['Z', 0, 1, 2],
['Z', 1, 0, 2],
['Z', 2, 0, 2],
], columns=['NAME', 'POSITION', 'PROB', 'SAMPLE'])
Of course due to the randomness involved, this is just one of a number of possible outcomes.
A unittest for the program would be that as the samples increases, by the law of large numbers, the mean number of our samples for each (NAME, POSITION)
pair, should tend to the actual probability. One could calculate a confidence region based on the total samples used and then make sure the true probability lies within it. For example using a normal approximation to binomial outcomes (requires total samples n_samples
to be 'large') a (-4 sd, 4 sd) region test would be:
z = 4
p_est = df_samples.groupby(['NAME', 'POSITION'])['PROB'].mean()
p_true = df.set_index(['NAME', 'POSITION'])['PROB']
CI_lower = p_est - z*np.sqrt(p_est*(1-p_est)/n_samples)
CI_upper = p_est + z*np.sqrt(p_est*(1-p_est)/n_samples)
assert p_true < CI_upper
assert p_true > CI_lower
What is the most efficient way to do this in Pandas? I feel like I want to apply some sample
function to the df.groupby('NAME')
object.
P.S.
To be even more explicit, here is a very long winded way of doing this using Numpy.
n_samples = 3
df_list = []
for name in ['X', 'Y', 'Z']:
idx = df['NAME'] == name
position_samples = np.random.choice(df.loc[idx, 'POSITION'],
n_samples,
p=df.loc[idx, 'PROB'])
prob = np.zeros([idx.sum(), n_samples])
prob[position_samples, np.arange(n_samples)] = 1
position = np.tile(np.arange(idx.sum())[:, None], n_samples)
sample = np.tile(np.arange(n_samples)[:,None], idx.sum()).T
df_list.append(pd.DataFrame(
[[name, prob.ravel()[i], position.ravel()[i],
sample.ravel()[i]]
for i in range(n_samples*idx.sum())],
columns=['NAME', 'PROB', 'POSITION', 'SAMPLE']))
df_samples = pd.concat(df_list)
Upvotes: 1
Views: 780
Reputation: 76297
If I understand correctly, you're looking for groupby
+ sample
and then some indexing stuff
First sample by the probabilites:
n_samples = 3
df_samples = df.groupby('NAME').apply(lambda x: x[['NAME', 'POSITION']] \
.sample(n_samples, replace=True,
weights=x.PROB)) \
.reset_index(drop=True)
Now add the extra columns:
df_samples['SAMPLE'] = df_samples.groupby('NAME').cumcount()
df_samples['PROB'] = 1
print(df_samples)
NAME POSITION SAMPLE PROB
0 X 1 0 1
1 X 0 1 1
2 X 1 2 1
3 Y 1 0 1
4 Y 1 1 1
5 Y 1 2 1
6 Z 2 0 1
7 Z 0 1 1
8 Z 0 2 1
Note that this doesn't include the 0 probability positions for each sample as requested in the initial question but it is a more concise way of storing the information.
If we want to also include the 0 probability positions we can merge in the other positions as follows:
domain = df[['NAME', 'POSITION']].drop_duplicates()
df_samples.drop('PROB', axis=1, inplace=True)
df_samples = pd.merge(df_samples, domain, on='NAME',
suffixes=['_sample', ''])
df_samples['PROB'] = (df_samples['POSITION'] ==
df_samples['POSITION_sample']).astype(int)
df_samples.drop('POSITION_sample', axis=1, inplace=True)
Upvotes: 2