Reputation: 5197
I have a dataframe df
. (cfg, x, rounds)
are unique, the rest individually are not.
cfg x rounds score rewards
0 f63c2c a 1 0.01 10
1 f63c2c a 2 0.02 15
2 f63c2c b 3 0.03 30
3 f63c2c b 4 0.04 13
4 f63c2c b 5 0.05 8
5 37fb26 a 1 0.08 8
6 35442a a 5 0.19 8
7 bb8460 b 2 0.05 9
I want to filter the dataframe in such a way, that only the cfg, x, max(rounds)
rows are in the result, i.e.
cfg x rounds score rewards
1 f63c2c a 2 0.02 15
4 f63c2c b 5 0.05 8
5 37fb26 a 1 0.08 8
6 35442a a 5 0.19 8
7 bb8460 b 2 0.05 9
For this I identify the maximum using:
gf = df.groupby(["cfg", "x"]).max().loc[:,["rounds"]]
However, i have not figured out a way to filter on df using gf as a predicate provider. Any ideas?
Upvotes: 2
Views: 2192
Reputation: 3855
The solution is not with groupby (or more accurately, the easiest solution is not with gorupby), but with drop_duplicates
.
By default drop_duplicates keeps the first row of any duplicate value, therfore you can sort your dataframe and then drop the duplicates with the following:
gf = df.sort_values(by = 'rounds',ascending = [True,False]).\
drop_duplicates(subset = ['cfg','x'])
cfg x rounds score rewards
6 35442a a 5 0.19 8
5 37fb26 a 1 0.08 8
7 bb8460 b 2 0.05 9
4 f63c2c b 5 0.05 8
You can also equivalently do:
gf = df.sort_values(by = 'rounds',ascending = True).\
drop_duplicates(subset = ['cfg','x'],keep = 'last')
EDIT: Timeit
Surprisingly I don't get the same timings as coldspeed in his answer:
df_test = pd.concat([df] * 100000)
%timeit df_test.sort_values(by = ['cfg','rounds'],ascending = True).\
drop_duplicates(subset = ['cfg'],keep = 'last')
%timeit df_test.groupby('cfg').rounds.apply(np.max).reset_index().\
merge(my_df2, on=['cfg', 'rounds'])
62 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
70.6 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This doesn't appear to depend on the number of cores (I've launched it on 8 and 12 cores yielding the same ranking), nor on the size of the dataframe (I've tried with df_test
10 000, 100 000, and 1 000 000 the size of df, and the ranking remains).
So I guess it must depend on your hardware, you will just have to try both methods and see what fits your computer.
Thanks coldspeed for pointing that out
Upvotes: 1
Reputation: 402263
It is indeed possible using df.groupby
and df.merge
:
n [231]: df.groupby(['cfg', 'x']).rounds\
...: .apply(np.max).reset_index()\
...: .merge(df, on=['cfg', 'x', 'rounds'])
Out[231]:
cfg x rounds score rewards
0 35442a a 5 0.19 8
1 37fb26 a 1 0.08 8
2 bb8460 b 2 0.05 9
3 f63c2c a 2 0.02 15
4 f63c2c b 5 0.05 8
And, using df.sort_values
:
In [237]: df.sort_values(by = ['cfg','x', 'rounds'],ascending = [True, True, False])\
.drop_duplicates(subset = ['cfg', 'x'])
Out[237]:
cfg x rounds score rewards
6 35442a a 5 0.19 8
5 37fb26 a 1 0.08 8
7 bb8460 b 2 0.05 9
1 f63c2c a 2 0.02 15
4 f63c2c b 5 0.05 8
Performance
df_test = pd.concat([df] * 100000) # Setup
Using df.merge
:
%timeit df_test.sort_values(by = ['cfg','x', 'rounds'],ascending = [True, True, False])
.drop_duplicates(subset = ['cfg', 'x'])
1 loop, best of 3: 229 ms per loop
Using df.sort_values
and df.drop_duplicates
:
%timeit df_test.groupby(['cfg', 'x']).rounds\
.apply(np.max).reset_index()\
.merge(df, on=['cfg', 'x', 'rounds'])
10 loops, best of 3: 129 ms ms per loop
Upvotes: 3