Removing random rows from a data frame until count is equal some criteria

Question

I have a dataframe with data that I feed to a ML library in python. The data I have is categorized into 5 different tasks, t1,t2,t3,t4,t5. The data I have right now for every task is uneven, to simplify things here is an example.

task, someValue
t1,   XXX
t1,   XXX
t1,   XXX
t1,   XXX
t2,   XXX
t2,   XXX

In the case above, I want to remove random rows with the task label of "t1" until there is an equal amount of "t1" as there is "t2" So after the code is run, it should look like this:

task, someValue
t1,   XXX
t1,   XXX
t2,   XXX
t2,   XXX

What is the most clean way to do this? I could of course just do for loops and if conditions and use random numbers and count the occurances for each iteration, but that solution would not be very elegant. Surely there must be a way using functions of dataframe? So far, this is what I got:

def equalize_rows(df):
    t = df['task'].value_counts()
    mininmum_occurance = min(t)

cs95 · Accepted Answer

You can calculate the smallest number of tasks in your dataFrame, and then use groupby + head to get the top N rows per task.

v = df['task'].value_counts().min()
df = df.groupby('task', as_index=False).head(v)

df
  task someValue
0   t1       XXX
1   t1       XXX
4   t2       XXX
5   t2       XXX

Removing random rows from a data frame until count is equal some criteria

Answers (1)

Related Questions