Reputation: 1190
Scenario
I have a dataframe with a given structure,and to sum it up at the end I want to find the time difference between the responses and requests of a service. It has the following columns:
And an example of data would be:
Timestamp Service Command Message_Type Message_ID
12:00:00 FoodOrders SeeStock() Request 125
12:00:02 FoodOrders SeeStock() Response 125
The output would have to be something like
Service Command Message_ID TimeDiff
FoodOrders SeeStock 125 00:00:02
What have I thought of doing
Grouping by Service,Command,Message_ID and add an additional column with some function that calculates the difference of time.
My actual questions
Thanks.
Upvotes: 1
Views: 282
Reputation: 863751
If performance is important, avoid aggregation and groupby, because slow, better is create Response
and Response
Series with MultiIndex and subtract Timestamp
s, sort_index
should also help with performance:
#if necessary
#df['Timestamp'] = pd.to_timedelta(df['Timestamp'])
cols = ['Service','Command','Message_ID']
s1 = df[df['Message_Type'] == 'Response'].set_index(cols)['Timestamp'].sort_index()
s2 = df[df['Message_Type'] == 'Request'].set_index(cols)['Timestamp'].sort_index()
df1 = s1.sub(s2).reset_index()
print (df1)
Service Command Message_ID Timestamp
0 FoodOrders SeeStock() 125 00:00:02
Upvotes: 1
Reputation: 1433
If you use jupiter notebook, you can try something like this:
%timeit df.sort_values('Time').groupby(['Service', 'Command', 'Message_Type', 'Message_ID']).apply(lambda x: x.iloc[1]['Time'] - x.iloc[0]['Time'])
In my sample I have this out:
2.97 ms ± 310 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And I also think that it's a good plan = )
Upvotes: 1
Reputation: 6280
followed by this code from another post:
import time
start = time.time()
print("hello")
end = time.time()
print(end - start)
you can measure the time on your own.
Try oyur approach and the lambda to test it.
Upvotes: 1
Reputation: 4268
The plan is more or less OK. Note that for efficiency it would be better not to pass a lambda function directly to calculate a custom aggregation like TimeDiff. It is better to first calculate auxiliary aggregations that can be done with pandas / numpy built-ins and then compute your custom aggregation based on those.
Upvotes: 1