Fastest alternative to pd.groupby.filter when filtering on group size

Question

I have a very huge DataFrame with millions of rows and about 20-30 columns with various types of data, eg. string, num, dates, etc.

df

index           t1      num1    float1  ...          str2  
0       2014-10-21      3456     0.000  ...  ayzkcxtoScUy             
1       2014-10-21      2453     0.000  ...  jZygJWtxyVnS             
...            ...       ...       ...  ...           ...           
n-1     2020-11-06    708735   670.818  ...  UWVhmKCfmzVj             
n       2020-11-06     70630   670.817  ...  EvhreYZotqVS

let's just say it's pretty wild, but I need each and every row with all of its values. I now want to group on certain columns and eliminate groups and rows from the original DataFrame df based on the group size. In particular, I want to eliminate all groups with size 1.

FIRST NAIVE APPROACH

I searched and tried using this answer: How to select rows in Pandas dataframe where value appears more than once

lst = ["t1", "str1", "num1", "str2", "num2"]
df = df.groupby(lst).filter(lambda x: len(x.index) > 1).reset_index(drop=True)

which indeed works as expected. My DataFrame df is now filtered from all rows which occured in groups which had a size of 1. The problem is the time in relation to the dimensions of my DataFrame using the filter method, which takes just too long. To put this into perspective, grouping on these sample columns would yield about ~ 165,000 groups with a 2.5 million rows DataFrame, about a third of these groups have a size of 1. I had to abort the execution of this script, because it would take ages. I further tried using inspiration from this link How do I improve the performance of pandas GroupBy filter operation? but couldn't get it working with map, because I'm grouping on a DataFrame and not on a Series. Using the transform method, performance worsened.

SIDENOTE

Investigating further, I found out that there has been a problem which occured when using filter on a DataFrame which has datetime64[ns, UTC] and/or datetime64[ns] columns. I used Del df[x] to delete all three of these columns, which increased performance of the filter method by about a third. Still not enough, but significant enough to mention it here, especially when I need these columns and can't just delete them.

SECOND 'CLEVER' APPROACH

I then tried to use clever indexing of my data to circumvent the usage of groupby, filter or transform at all by using .value_counts() from link 1.

vc = df[lst].value_counts()
vc_index = vc.index[vc.gt(1)]
df = data[data[lst].isin(vc_index)]

I'm getting the value counts vc to locate all indices with a count of 1 and then create a MultiIndex new_index with only the wanted indices (which are count > 1). After that, I tried filtering my df with .isin() like in link 1, which set all values of df to NaN/NaT. I'm stuck here - I'm not sure what I did wrong here.

df

index      t1      num1    float1  ...    str2  
0         NaT       NaN       NaN  ...     NaN             
1         NaT       NaN       NaN  ...     NaN             
...       ...       ...       ...  ...     ...           
n-1       NaT       NaN       NaN  ...     NaN             
n         NaT       NaN       NaN  ...     NaN

In another attempt, I tried using the pd.index.difference() method

vc = data[lst].value_counts()
df = data.set_index(keys=lst)
df.index = df.index.difference(other=vc.index[vc.gt(1)])

but this only gave me an TypeError: '<' not supported between instances of 'float' and 'str'.

Honestly, I'm kinda clueless whats best here, sitting on this problem since two days. I even thought about parallelization (maybe with Dask?), but I'm unsure how to use that since I've never worked with it. Help would be much appreciated.

Quang Hoang · Accepted Answer

For this specific use case (group with count > 1), duplicated is much faster:

df[df.duplicated(lst, keep=False)]
# 231 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Another option, not quite as fast, but significantly faster than filter and works in general is groupby().transform('size'):

df[df.groupby(lst)['t1'].transform('size')>1]
# 554 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Compared to:

df.groupby(lst).filter(lambda x: len(x) > 1)
# CPU times: user 38.8 s, sys: 482 ms, total: 39.3 s
# Wall time: 39.4 s

Fastest alternative to pd.groupby.filter when filtering on group size

Answers (2)

Related Questions