Reputation: 351
Code:
import pandas as pd
df = pd.DataFrame({'data': list(range(100))})
I want to take a sample of size 20, such that 80% of the elements are between 0 and 10, and 20% are between 50 and 70. (randomly sampled).
I want a method which works for arbitrary number of conditions.
My idea which works but is not clean: sample everything which is between 0 and 10 and take 80% * 20 random rows, do the same for the rest of the values, and concatenate. Is there a pandas built-in which I can use, because this does not scale well for larger number of conditions?
Upvotes: 1
Views: 1054
Reputation: 260500
You can use bins and a dictionary of the proportions to sample.
# ┌─0─┐┌─1─┐┌─2─┐┌─3───┐
bins = [-1, 10, 50, 70, float('inf')]
fraction = {0: 0.8, 2: 0.2} # group 0 is -1-10, group 2 is 50-70
size = 20
groups = pd.cut(df['data'], bins=bins, labels=range(len(bins)-1))
sampled = (df
.groupby(groups)['data']
.apply(lambda g: g.sample(n=int(fraction.get(g.name, 0)*size),
replace=True)
)
#.droplevel(0)
)
NB. I used replace=True
in sample here as this would be impossible otherwise to get 16 unique elements from the group 0-10, but you can change that in your real data of the condition is safe. Also, add .droplevel(0)
to remove the group id.
output:
data
0 6 6
8 8
2 2
3 3
0 0
0 0
6 6
0 0
6 6
3 3
10 10
3 3
8 8
8 8
8 8
2 2
2 54 54
53 53
62 62
64 64
Name: data, dtype: int64
Upvotes: 1