Reputation: 45752
Given a DataFrame
similar to this (but with over a million rows and about 140000 different group
s)
df_test = pd.DataFrame({'group': {1:'A', 2:'A', 3:'A', 4:'A', 5:'B', 6:'B'},
'time' : {1:1, 2:3, 3:5, 4:23, 5: 7, 6: 12}})
for each group
I want to find the difference between the time
(which is actually a dtype('<M8[ns]')
in my real df) and the minimum time for that group
.
I have managed it using groupby
and transform
as follows:
df_test['time_since'] = df_test.groupby('group')['time'].transform(lambda d: d - d.min())
which correctly produces:
group time time_since
1 A 1 0
2 A 3 2
3 A 5 4
4 A 23 22
5 B 7 0
6 B 12 5
but it takes almost a minute to compute. Is there a faster / smarter way to do this?
Upvotes: 1
Views: 779
Reputation: 323226
My suggestion: doing lambda
(calculation) outside the transform
, so we do not need lambda here. With the lambda
, we calling the calculation couple times (Depends on how many groups)
df_test=pd.concat([df_test]*1000)
%timeit df_test['time']-df_test.groupby('group')['time'].transform(min)
1000 loops, best of 3: 1.11 ms per loop
%timeit df_test.groupby('group')['time'].transform(lambda d: d - d.min())
The slowest run took 7.20 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.3 ms per loop
Upvotes: 3