Reputation:
Person ID | Condition 1 | Condition 2 | Condition 3 |
---|---|---|---|
A | Yes | No | Yes |
B | No | Yes | No |
C | Yes | No | No |
Hi! I have to generate a sample from a fairly large dataset, and the inclusion criteria are a little more complicated than those I have previously handled. For this sample, I need 100 people. 25 should have condition 1, 25 should have condition 2, and 25 should have condition 3. The last 25 should have none of the conditions (this is an easy one, no problem here).
The conditions don't have to be mutually exclusive - so a person can have condition 1 and condition 2, but this would count as a case for each of these conditions. Because of the nature of the dataset, it is very likely that most people will have a combination of conditions (and this is actually preferable to a sample in which the 3 conditions are entirely distinct from one another). Does this make sense? It's not conceptually difficult, I just haven't been able to figure out the solution!
Usually I would use a line like this to filter the variables I need for a sample:
sample = df[(df['condition 1'] == "yes") & (df["condition 2"] != 0) & (df["condition 3"] != 0)]
But in this case the sampling scheme is more complicated, and using this approach would take a lot of trial and error
Upvotes: 0
Views: 80
Reputation: 5470
Assuming: that you want random sampling; that no rows are chosen more than once; that you want a specific number of each condition but more than one may occur in a row (so less than 100 rows total) then you could use the following approach as demonstrated using a simplified example. It is necessary to keep track of all conditions as samples are collected and also to handle the case that there are not enough samples meeting the condition.
import pandas as pd
num = 2 # number of rows in each of the 4 samples
# create DF with random 0/1 values for each Cond column
df= pd.DataFrame({'id': ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'],
'Cond1': [1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0],
'Cond2': [0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1],
'Cond3': [0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
})
# take sample after checking maximum available to meet condition
x = min(num,df[df['Cond1'].eq(1)].shape[0])
sample1 = df[df['Cond1'].eq(1)].sample(x)
#drop all rows meeting Cond1 as now have required samples or max available for Cond1
rest = df.drop(df[df['Cond1'].eq(1)].index)
#count numnber of samples already have for remaining Conds
s1c2 = sample1[sample1['Cond2'].eq(1)].shape[0]
s1c3 = sample1[sample1['Cond3'].eq(1)].shape[0]
# obtain required remaining samples for Cond2 or as many available
# also not exceeded required samples for Cond3 (or loop again)
x = min(num-s1c2,df[df['Cond2'].eq(1)].shape[0])
while True:
sample2 = rest[rest['Cond2'].eq(1)].sample(x)
c3 = s1c3 + sample2[sample2['Cond3'].eq(1)].shape[0]
if c3 > num:
continue
break
#drop all rows meeting Cond2 as now have required samples for Cond2
rest = rest.drop(rest[rest['Cond2'].eq(1)].index)
#obtain any remaining samples required for Cond3 or as many as available
x = min(num-c3,df[df['Cond3'].eq(1)].shape[0])
sample3 = rest[rest['Cond3'].eq(1)].sample(x)
#obtain last sample or as many as available
x = min(num,df[df[['Cond1', 'Cond2', 'Cond3']].eq(0).all(1)].shape[0])
sample4 = df[df[['Cond1', 'Cond2', 'Cond3']].eq(0).all(1)].sample(x)
# combine all samples to give DF with num x 4 rows or as many as meet the conditions
final = pd.concat([sample1,sample2, sample3, sample4])
#add .reset_index(drop=True) if sequential index required
print(final)
which gives from 3 code runs:
id Cond1 Cond2 Cond3
19 t 1 0 0
20 u 1 1 0
2 c 0 1 1
18 s 0 0 1
16 q 0 0 0
15 p 0 0 0
id Cond1 Cond2 Cond3
10 k 1 0 1
24 y 1 1 0
25 z 0 1 1
15 p 0 0 0
1 b 0 0 0
id Cond1 Cond2 Cond3
12 m 1 0 0
19 t 1 0 0
9 j 0 1 0
6 g 0 1 0
22 w 0 0 1
5 f 0 0 1
1 b 0 0 0
15 p 0 0 0
Upvotes: 0