How to only keep rows where a value in a column appear often enough

Question

a=df.groupby('value').size()
newFrame = pd.DataFrame()

for el in a.keys():
    if a[el] > 300000:
        newFrame = pd.concat([newFrame, df[df.value == el]])

I have written this code which does what I want, but is really slow. I only want to keep the rows where the 'value' entry is the same as in 300000 other rows. If it's contained less often than that, I want to drop it.

jezrael · Accepted Answer

Use GroupBy.transform for Series with same size like original filled by counts with GroupBy.size and filter by boolean indexing:

df = df[df.groupby('value')['value'].transform('size') > 300000]

If processing output later:

df = df[df.groupby('value')['value'].transform('size') > 300000].copy()

How to only keep rows where a value in a column appear often enough

Answers (2)

Related Questions