Analytics_TM
Analytics_TM

Reputation: 523

Python random sample selection based on multiple conditions

I want to make a random sample selection in python from the following df such that at least 65% of the resulting sample should have color yellow and cumulative sum of the quantities selected to be less than or equals to 18.

Original Dataset:

Date        Id      color       qty
02-03-2018  A       red         5
03-03-2018  B       blue        2
03-03-2018  C       green       3
04-03-2018  D       yellow      4
04-03-2018  E       yellow      7
04-03-2018  G       yellow      6
04-03-2018  H       orange      8
05-03-2018  I       yellow      1
06-03-2018  J       yellow      5

I have got total qty. selected condition covered but stuck on how to move forward with integrating the % condition:

df2 = df1.sample(n=df1.shape[0])

df3= df2[df2.qty.cumsum() <= 18]

Required dataset:

Date        Id      color       qty
03-03-2018  B       blue        2
04-03-2018  D       yellow      4
04-03-2018  G       yellow      6
06-03-2018  J       yellow      5

Or something like this:

Date        Id      color       qty
02-03-2018  A       red         5
04-03-2018  D       yellow      4
04-03-2018  E       yellow      7
05-03-2018  I       yellow      1

Any help would be really appreciated!

Thanks in advance.

Upvotes: 4

Views: 5933

Answers (3)

panktijk
panktijk

Reputation: 1614

  1. Filter rows with 'yellow' and select a random sample of at least 65% of your total sample size

    import random
    yellow_size = float(random.randint(65,100)) / 100
    df_yellow = df3[df3['color'] == 'yellow'].sample(yellow_size*sample_size)
    
  2. Filter rows with other colors and select a random sample for the remaining of your sample size.

    others_size = 1 - yellow_size
    df_others = df3[df3['color'] != 'yellow].sample(others_size*sample_size)
    
  3. Combine them both and shuffle the rows.

    df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
    

UPDATE:

If you want to check for both conditions simultaneously, this could be one way to do it:

import random

df_sample = df

while sum(df_sample['qty']) > 18:
    yellow_size = float(random.randint(65,100)) / 100
    df_yellow = df[df['color'] == 'yellow'].sample(yellow_size*sample_size)
    others_size = 1 - yellow_size
    df_others = df[df['color'] != 'yellow'].sample(others_size*sample_size)
    df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)

Upvotes: 3

mtgarden
mtgarden

Reputation: 67

I think this example help you. I add columns df2['yellow_rate'] and calculate rate. You only check df2.iloc[df2.shape[0] - 1]['yellow_rate'] value.

df1=pd.DataFrame({'id':['A','B','C','D','E','G','H','I','J'],'color':['red','bule','green','yellow','yellow','yellow','orange','yellow','yellow'], 'qty':[5,2, 3, 4, 7, 6, 8, 1, 5]})
df2 = df1.sample(n=df1.shape[0])
df2['yellow_rate'] =  df2[df2.qty.cumsum() <= 18]['color'].apply( lambda x : 1 if x =='yellow' else 0)
df2 = df2.dropna().append(df2.sum(numeric_only=True)/ df2.count(numeric_only=True), ignore_index=True)

Upvotes: 0

John R
John R

Reputation: 1508

I would use this package to over sample your yellows into a new sample that has the balance you want:

https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html

From there just randomly select items and check sum until you have the set you want.

Something less time complex would be binary searching a range the length of your data frame, and using the binary search term as your sample size, until you get the cumsum you want. The assumes the feature is symmetrically distributed.

Upvotes: 0

Related Questions