Reputation: 103
I have a canonical Pandas transform
example in which performance seems inexplicably slow. I have read the Q&A on the apply
method, which is related but, in my humble opinion, offers an incomplete and potentially misleading answer to my question as I explain below.
The first five lines of my dataframe are
id date xvar 0 1004 1992-05-31 4.151628 1 1004 1993-05-31 2.868015 2 1004 1994-05-31 3.043287 3 1004 1995-05-31 3.189541 4 1004 1996-05-31 4.008760
id
values.I want to center xvar
by id
.
Approach 1 takes 861 ms:
df_r['xvar_center'] = (
df_r
.groupby('id')['xvar']
.transform(lambda x: x - x.mean())
)
Approach 2 takes 9 ms:
# Group means
df_r_mean = (
df_r
.groupby('id', as_index=False)['xvar']
.mean()
.rename(columns={'xvar':'xvar_avg'})
)
# Merge group means onto dataframe and center
df_w = (
pd
.merge(df_r, df_r_mean, on='id', how='left')
.assign(xvar_center=lambda x: x.xvar - x.xvar_avg)
)
The Q&A on the apply
method recommends relying on vectorized functions whenever possible, much like @sammywemmy's comment implies. This I see as overlap. However, the Q&A on the apply
method also sates:
"...here are some common situations where you will want to get rid of any calls to
apply
...Numeric Data"
@sammywemmy's comment does not "get rid of any calls to" the transform
method in their answer to my question. On the contrary, the answer relies on the transform
method. Therefore, unless @sammywemmy's suggestion is strictly dominated by an alternative approach that does not rely on the transform
method, I think my question and its answer are sufficiently distinct from the discussion in Q&A on the apply
method. (Thank you for your patience and help.)
Upvotes: 4
Views: 2853
Reputation: 103
This answer is due to the insightful comment from @sammywemmy, who deserves all credit and no blame for any inaccuracy here. Because a similar usage of transform
is illustrated in the Pandas User's Guide, I thought elaborating may be useful for others.
My hypothesis is that the problem rests with a combination of using a non-vectorized function and a large number of groups. When I change the groupby
variable from id
(2,992 unique values) to year
(constructed from the date
variable and containing 28 unique values), the performance difference between my original approach and @sammywemmy's narrows substantially but is still significant.
%%timeit
df_r['xvar_center_y'] = (
df_r
.groupby('year')['xvar']
.transform(lambda x: x - x.mean())
)
11.4 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vs.
%timeit df_r['xvar_center_y'] = df_r.xvar - df_r.groupby('year')['xvar'].transform('mean')
1.69 ms ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The beauty of @sammywemmy's insight is that it is easy to apply to other common transformations for potentially significant performance improvements and at a modest cost in terms of additional code. For example, consider standardizing a variable:
%%timeit
df_r['xvar_z'] = (
df_r
.groupby('id')['xvar']
.transform(lambda x: (x - x.mean()) / x.std())
)
1.34 s ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vs.
%%timeit
df_r['xvar_z'] = (
(df_r.xvar - df_r.groupby('id')['xvar'].transform('mean'))
/ df_r.groupby('id')['xvar'].transform('std')
)
3.96 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 3