Reputation: 137
I have a very huge DataFrame with millions of rows and about 20-30 columns with various types of data, eg. string, num, dates, etc.
df
index t1 num1 float1 ... str2
0 2014-10-21 3456 0.000 ... ayzkcxtoScUy
1 2014-10-21 2453 0.000 ... jZygJWtxyVnS
... ... ... ... ... ...
n-1 2020-11-06 708735 670.818 ... UWVhmKCfmzVj
n 2020-11-06 70630 670.817 ... EvhreYZotqVS
let's just say it's pretty wild, but I need each and every row with all of its values. I now want to group on certain columns and eliminate groups and rows from the original DataFrame df
based on the group size. In particular, I want to eliminate all groups with size 1.
FIRST NAIVE APPROACH
I searched and tried using this answer: How to select rows in Pandas dataframe where value appears more than once
lst = ["t1", "str1", "num1", "str2", "num2"]
df = df.groupby(lst).filter(lambda x: len(x.index) > 1).reset_index(drop=True)
which indeed works as expected. My DataFrame df
is now filtered from all rows which occured in groups which had a size of 1. The problem is the time in relation to the dimensions of my DataFrame using the filter method, which takes just too long. To put this into perspective, grouping on these sample columns would yield about ~ 165,000 groups with a 2.5 million rows DataFrame, about a third of these groups have a size of 1. I had to abort the execution of this script, because it would take ages. I further tried using inspiration from this link How do I improve the performance of pandas GroupBy filter operation? but couldn't get it working with map
, because I'm grouping on a DataFrame and not on a Series. Using the transform
method, performance worsened.
SIDENOTE
Investigating further, I found out that there has been a problem which occured when using filter
on a DataFrame which has datetime64[ns, UTC]
and/or datetime64[ns]
columns. I used Del df[x]
to delete all three of these columns, which increased performance of the filter method by about a third. Still not enough, but significant enough to mention it here, especially when I need these columns and can't just delete them.
SECOND 'CLEVER' APPROACH
I then tried to use clever indexing of my data to circumvent the usage of groupby, filter or transform at all by using .value_counts()
from link 1.
vc = df[lst].value_counts()
vc_index = vc.index[vc.gt(1)]
df = data[data[lst].isin(vc_index)]
I'm getting the value counts vc
to locate all indices with a count of 1 and then create a MultiIndex new_index
with only the wanted indices (which are count > 1
). After that, I tried filtering my df
with .isin()
like in link 1, which set all values of df
to NaN/NaT. I'm stuck here - I'm not sure what I did wrong here.
df
index t1 num1 float1 ... str2
0 NaT NaN NaN ... NaN
1 NaT NaN NaN ... NaN
... ... ... ... ... ...
n-1 NaT NaN NaN ... NaN
n NaT NaN NaN ... NaN
In another attempt, I tried using the pd.index.difference()
method
vc = data[lst].value_counts()
df = data.set_index(keys=lst)
df.index = df.index.difference(other=vc.index[vc.gt(1)])
but this only gave me an TypeError: '<' not supported between instances of 'float' and 'str'
.
Honestly, I'm kinda clueless whats best here, sitting on this problem since two days. I even thought about parallelization (maybe with Dask?), but I'm unsure how to use that since I've never worked with it. Help would be much appreciated.
Upvotes: 0
Views: 1204
Reputation: 137
The solution from @Quang Hoang is working very good. Some benchmarks I've made with my dataset:
(rs = rows of df
, ngrps = df.groupby(lst).ngroups
)
method 100k rs/82.488 ngrps 200k rs/164.466 ngrps 400k rs/331.351 ngrps 800k rs/672.905 ngrps 1.600k rs/1.351.525 ngrps
duplicated 0:00:00.031236 0:00:00.078112 0:00:00.181825 0:00:00.331095 0:00:00.683959
transform 0:00:00.062507 0:00:00.109386 0:00:00.261506 0:00:00.528166 0:00:01.029606
filter 0:00:09.039214 0:00:18.422355 0:00:37.372117 0:01:15.531945 0:02:32.075144
It is scaling pretty well using duplicated, but be aware: If there are NaN values in columns which are inside your list (on which you want to group, lst
in my example), duplicate will not drop them.
Upvotes: 0
Reputation: 150785
For this specific use case (group with count > 1), duplicated
is much faster:
df[df.duplicated(lst, keep=False)]
# 231 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another option, not quite as fast, but significantly faster than filter
and works in general is groupby().transform('size')
:
df[df.groupby(lst)['t1'].transform('size')>1]
# 554 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Compared to:
df.groupby(lst).filter(lambda x: len(x) > 1)
# CPU times: user 38.8 s, sys: 482 ms, total: 39.3 s
# Wall time: 39.4 s
Upvotes: 3