Reputation: 523
I want to make a random sample selection in python from the following df such that at least 65% of the resulting sample should have color yellow and cumulative sum of the quantities selected to be less than or equals to 18.
Original Dataset:
Date Id color qty
02-03-2018 A red 5
03-03-2018 B blue 2
03-03-2018 C green 3
04-03-2018 D yellow 4
04-03-2018 E yellow 7
04-03-2018 G yellow 6
04-03-2018 H orange 8
05-03-2018 I yellow 1
06-03-2018 J yellow 5
I have got total qty. selected condition covered but stuck on how to move forward with integrating the % condition:
df2 = df1.sample(n=df1.shape[0])
df3= df2[df2.qty.cumsum() <= 18]
Required dataset:
Date Id color qty
03-03-2018 B blue 2
04-03-2018 D yellow 4
04-03-2018 G yellow 6
06-03-2018 J yellow 5
Or something like this:
Date Id color qty
02-03-2018 A red 5
04-03-2018 D yellow 4
04-03-2018 E yellow 7
05-03-2018 I yellow 1
Any help would be really appreciated!
Thanks in advance.
Upvotes: 4
Views: 5933
Reputation: 1614
Filter rows with 'yellow'
and select a random sample of at least 65% of your total sample size
import random
yellow_size = float(random.randint(65,100)) / 100
df_yellow = df3[df3['color'] == 'yellow'].sample(yellow_size*sample_size)
Filter rows with other colors and select a random sample for the remaining of your sample size.
others_size = 1 - yellow_size
df_others = df3[df3['color'] != 'yellow].sample(others_size*sample_size)
Combine them both and shuffle the rows.
df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
UPDATE:
If you want to check for both conditions simultaneously, this could be one way to do it:
import random
df_sample = df
while sum(df_sample['qty']) > 18:
yellow_size = float(random.randint(65,100)) / 100
df_yellow = df[df['color'] == 'yellow'].sample(yellow_size*sample_size)
others_size = 1 - yellow_size
df_others = df[df['color'] != 'yellow'].sample(others_size*sample_size)
df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
Upvotes: 3
Reputation: 67
I think this example help you. I add columns df2['yellow_rate'] and calculate rate. You only check df2.iloc[df2.shape[0] - 1]['yellow_rate'] value.
df1=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J'],'color':['red','bule','green','yellow','yellow','yellow','orange','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5]})
df2 = df1.sample(n=df1.shape[0])
df2['yellow_rate'] = df2[df2.qty.cumsum() <= 18]['color'].apply( lambda x : 1 if x =='yellow' else 0)
df2 = df2.dropna().append(df2.sum(numeric_only=True)/ df2.count(numeric_only=True), ignore_index=True)
Upvotes: 0
Reputation: 1508
I would use this package to over sample your yellows into a new sample that has the balance you want:
https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html
From there just randomly select items and check sum until you have the set you want.
Something less time complex would be binary searching a range the length of your data frame, and using the binary search term as your sample size, until you get the cumsum you want. The assumes the feature is symmetrically distributed.
Upvotes: 0