Reputation: 43
Have a df with values :
name algo accuracy
tom 1 88
tommy 2 87
mark 1 88
stuart 3 100
alex 2 99
lincoln 1 88
How to randomly pick 4 records from df with a condition that at least one record should be picked from each unique algo column values. here, algo column has only 3 unique values (1 , 2 , 3 )
Sample outputs:
name algo accuracy
tom 1 88
tommy 2 87
stuart 3 100
lincoln 1 88
sample output2:
name algo accuracy
mark 1 88
stuart 3 100
alex 2 99
lincoln 1 88
Upvotes: 1
Views: 284
Reputation: 150725
One way
num_sample, num_algo = 4, 3
# sample one for each algo
out = df.groupby('algo').sample(n=num_sample//num_algo)
# append one more sample from those that didn't get selected.
out = out.append(df.drop(out.index).sample(n=num_sample-num_algo) )
Another way is to shuffle the whole data, enumerate the rows within each algo, sort by that enumeration and take the required number of samples. This is slightly more code than the first approach, but is cheaper and produces more balanced algo counts:
# shuffle data
df_random = df['algo'].sample(frac=1)
# enumerations of rows with the same algo
enums = df_random.groupby(df_random).cumcount()
# sort with `np.argsort`:
enums = enums.sort_values()
# pick the first num_sample indices
# these will be indices of the samples
# so we can use `loc`
out = df.loc[enums.iloc[:num_sample].index]
Upvotes: 3