Reputation: 406
I have a dataset, on which I want to do sampling after groupby. In general it can be achieved with df.groupby("some_id").sample(n=100)
. But the problem is that some groups have less than 100 samples (and yes replace=True is a choice but what if we want to keep sample less, I mean if the group has more than 100 samples i want to take sample size of 100, if less - leave it as it is). I couldn't find one example of achieving something similar, and any ideas are appretiated.
For now the only idea I have is to forget about groupby, create lets say list of groups or something like
groups_list=[]
for i in df.some_id.unique():
groups_list.append(df[df_some_id==i].apply(weird_sampling))
def weird_sampling(df):
if (df.shape[0]>99):
return df.sample(100)
return df
But it seems extremely unefficient
Upvotes: 4
Views: 2070
Reputation: 63
very similar to above answer by Igor but I guess less hardcoded and using random_state
and reset_index
. I would advise to use random_state
if you want to make your results reproducible.
def sample(df, sample_size, seed):
return df.sample(n=min(len(df), sample_size), random_state=seed)
df_sample = df.groupby(['some_id']).apply(sample, sample_size=100, seed=67871215).reset_index(drop=True)
df_sample
Upvotes: 0
Reputation: 1284
I think the cleanest answer might be to shuffle your data and then select up to n
of each group:
# maximum number of elements in group
n = 100
# sample(frac=1) --> randomise the order
# groupby("some_id").head(n) --> select up to n
df.sample(frac=1).groupby("some_id").head(n)
Upvotes: 8
Reputation: 406
After some more trials with this problem I came up with this idea, which still may not be the best or most efficient solution, but is already much better and does the job
df = df.groupby("some_id").apply(lambda x: x.sample(n = 100) if (x.shape[0]>99) else x)
Upvotes: 2