Reputation:

Sampling in python with multiple conditions and percentages

Person ID	Condition 1	Condition 2	Condition 3
A	Yes	No	Yes
B	No	Yes	No
C	Yes	No	No

Hi! I have to generate a sample from a fairly large dataset, and the inclusion criteria are a little more complicated than those I have previously handled. For this sample, I need 100 people. 25 should have condition 1, 25 should have condition 2, and 25 should have condition 3. The last 25 should have none of the conditions (this is an easy one, no problem here).

The conditions don't have to be mutually exclusive - so a person can have condition 1 and condition 2, but this would count as a case for each of these conditions. Because of the nature of the dataset, it is very likely that most people will have a combination of conditions (and this is actually preferable to a sample in which the 3 conditions are entirely distinct from one another). Does this make sense? It's not conceptually difficult, I just haven't been able to figure out the solution!

Usually I would use a line like this to filter the variables I need for a sample:

sample = df[(df['condition 1'] == "yes") & (df["condition 2"] != 0) & (df["condition 3"] != 0)]

But in this case the sampling scheme is more complicated, and using this approach would take a lot of trial and error

Upvotes: 0

Answers (1)

user19077881

Reputation: 5470

Assuming: that you want random sampling; that no rows are chosen more than once; that you want a specific number of each condition but more than one may occur in a row (so less than 100 rows total) then you could use the following approach as demonstrated using a simplified example. It is necessary to keep track of all conditions as samples are collected and also to handle the case that there are not enough samples meeting the condition.

import pandas as pd

num = 2   # number of rows in each of the 4 samples

# create DF with random 0/1 values for each Cond column
df= pd.DataFrame({'id': ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'],
                   'Cond1': [1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0],
                  'Cond2': [0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1],
                  'Cond3': [0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
                  })

# take sample after checking maximum available to meet condition
x = min(num,df[df['Cond1'].eq(1)].shape[0])
sample1 = df[df['Cond1'].eq(1)].sample(x)

#drop all rows meeting Cond1 as now have required samples or max available for Cond1
rest = df.drop(df[df['Cond1'].eq(1)].index)

#count numnber of samples already have for remaining Conds
s1c2 = sample1[sample1['Cond2'].eq(1)].shape[0]
s1c3 = sample1[sample1['Cond3'].eq(1)].shape[0]

# obtain required remaining samples for Cond2 or as many available
# also not exceeded required samples for Cond3 (or loop again)
x = min(num-s1c2,df[df['Cond2'].eq(1)].shape[0])
while True:
    sample2 = rest[rest['Cond2'].eq(1)].sample(x)
    c3 = s1c3 + sample2[sample2['Cond3'].eq(1)].shape[0]
    if c3 > num:
        continue
    break

#drop all rows meeting Cond2 as now have required samples for Cond2
rest = rest.drop(rest[rest['Cond2'].eq(1)].index)

#obtain any remaining samples required for Cond3 or as many as available
x = min(num-c3,df[df['Cond3'].eq(1)].shape[0])
sample3 = rest[rest['Cond3'].eq(1)].sample(x)

#obtain last sample or as many as available
x = min(num,df[df[['Cond1', 'Cond2', 'Cond3']].eq(0).all(1)].shape[0])
sample4 = df[df[['Cond1', 'Cond2', 'Cond3']].eq(0).all(1)].sample(x)

# combine all samples to give DF with num x 4 rows or as many as meet the conditions
final = pd.concat([sample1,sample2, sample3, sample4])
#add .reset_index(drop=True) if sequential index required

print(final)

which gives from 3 code runs:

   id  Cond1  Cond2  Cond3
19  t      1      0      0
20  u      1      1      0
2   c      0      1      1
18  s      0      0      1
16  q      0      0      0
15  p      0      0      0

   id  Cond1  Cond2  Cond3
10  k      1      0      1
24  y      1      1      0
25  z      0      1      1
15  p      0      0      0
1   b      0      0      0

   id  Cond1  Cond2  Cond3
12  m      1      0      0
19  t      1      0      0
9   j      0      1      0
6   g      0      1      0
22  w      0      0      1
5   f      0      0      1
1   b      0      0      0
15  p      0      0      0

Upvotes: 0

Sampling in python with multiple conditions and percentages

Answers (1)

Related Questions