CptSnuggles
CptSnuggles

Reputation: 137

Fastest alternative to pd.groupby.filter when filtering on group size

I have a very huge DataFrame with millions of rows and about 20-30 columns with various types of data, eg. string, num, dates, etc.

df

index           t1      num1    float1  ...          str2  
0       2014-10-21      3456     0.000  ...  ayzkcxtoScUy             
1       2014-10-21      2453     0.000  ...  jZygJWtxyVnS             
...            ...       ...       ...  ...           ...           
n-1     2020-11-06    708735   670.818  ...  UWVhmKCfmzVj             
n       2020-11-06     70630   670.817  ...  EvhreYZotqVS             

let's just say it's pretty wild, but I need each and every row with all of its values. I now want to group on certain columns and eliminate groups and rows from the original DataFrame df based on the group size. In particular, I want to eliminate all groups with size 1.

FIRST NAIVE APPROACH

I searched and tried using this answer: How to select rows in Pandas dataframe where value appears more than once

lst = ["t1", "str1", "num1", "str2", "num2"]
df = df.groupby(lst).filter(lambda x: len(x.index) > 1).reset_index(drop=True)

which indeed works as expected. My DataFrame df is now filtered from all rows which occured in groups which had a size of 1. The problem is the time in relation to the dimensions of my DataFrame using the filter method, which takes just too long. To put this into perspective, grouping on these sample columns would yield about ~ 165,000 groups with a 2.5 million rows DataFrame, about a third of these groups have a size of 1. I had to abort the execution of this script, because it would take ages. I further tried using inspiration from this link How do I improve the performance of pandas GroupBy filter operation? but couldn't get it working with map, because I'm grouping on a DataFrame and not on a Series. Using the transform method, performance worsened.

SIDENOTE

Investigating further, I found out that there has been a problem which occured when using filter on a DataFrame which has datetime64[ns, UTC] and/or datetime64[ns] columns. I used Del df[x] to delete all three of these columns, which increased performance of the filter method by about a third. Still not enough, but significant enough to mention it here, especially when I need these columns and can't just delete them.

SECOND 'CLEVER' APPROACH

I then tried to use clever indexing of my data to circumvent the usage of groupby, filter or transform at all by using .value_counts() from link 1.

vc = df[lst].value_counts()
vc_index = vc.index[vc.gt(1)]
df = data[data[lst].isin(vc_index)]

I'm getting the value counts vc to locate all indices with a count of 1 and then create a MultiIndex new_index with only the wanted indices (which are count > 1). After that, I tried filtering my df with .isin() like in link 1, which set all values of df to NaN/NaT. I'm stuck here - I'm not sure what I did wrong here.

df

index      t1      num1    float1  ...    str2  
0         NaT       NaN       NaN  ...     NaN             
1         NaT       NaN       NaN  ...     NaN             
...       ...       ...       ...  ...     ...           
n-1       NaT       NaN       NaN  ...     NaN             
n         NaT       NaN       NaN  ...     NaN             

In another attempt, I tried using the pd.index.difference() method

vc = data[lst].value_counts()
df = data.set_index(keys=lst)
df.index = df.index.difference(other=vc.index[vc.gt(1)])

but this only gave me an TypeError: '<' not supported between instances of 'float' and 'str'.

Honestly, I'm kinda clueless whats best here, sitting on this problem since two days. I even thought about parallelization (maybe with Dask?), but I'm unsure how to use that since I've never worked with it. Help would be much appreciated.

Upvotes: 0

Views: 1204

Answers (2)

CptSnuggles
CptSnuggles

Reputation: 137

The solution from @Quang Hoang is working very good. Some benchmarks I've made with my dataset:

(rs = rows of df, ngrps = df.groupby(lst).ngroups)

method   100k rs/82.488 ngrps  200k rs/164.466 ngrps  400k rs/331.351 ngrps  800k rs/672.905 ngrps  1.600k rs/1.351.525 ngrps

duplicated     0:00:00.031236         0:00:00.078112         0:00:00.181825         0:00:00.331095             0:00:00.683959
transform      0:00:00.062507         0:00:00.109386         0:00:00.261506         0:00:00.528166             0:00:01.029606
filter         0:00:09.039214         0:00:18.422355         0:00:37.372117         0:01:15.531945             0:02:32.075144

It is scaling pretty well using duplicated, but be aware: If there are NaN values in columns which are inside your list (on which you want to group, lst in my example), duplicate will not drop them.

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150785

For this specific use case (group with count > 1), duplicated is much faster:

df[df.duplicated(lst, keep=False)]
# 231 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Another option, not quite as fast, but significantly faster than filter and works in general is groupby().transform('size'):

df[df.groupby(lst)['t1'].transform('size')>1]
# 554 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Compared to:

df.groupby(lst).filter(lambda x: len(x) > 1)
# CPU times: user 38.8 s, sys: 482 ms, total: 39.3 s
# Wall time: 39.4 s

Upvotes: 3

Related Questions