Reputation: 2005
I am working on log analysis and I have manged to get the information as this:
job id task start_timestamp end_timestamp log_level
-----------------------------------------------------------------------
job1 1 task_1 06/02/2020 14:00 INFO
job1 1 task_2 06/02/2020 14:00 INFO
job1 1 task_3 06/02/2020 14:00 INFO
job1 1 task_2 06/02/2020 23:00 INFO
job1 1 task_3 06/02/2020 23:00 INFO
job2 2 task_1 06/02/2020 14:00 INFO
job2 2 task_1 06/02/2020 23:00 INFO
I am not sure what kind of operation I should perform on the DataFrame to get to this:
job id task start_timestamp end_timestamp log_level
-----------------------------------------------------------------------
job1 1 task_1 06/02/2020 14:00 INFO
job1 1 task_2 06/02/2020 14:00 06/02/2020 23:00 INFO
job1 1 task_3 06/02/2020 14:00 06/02/2020 23:00 INFO
job2 2 task_1 06/02/2020 14:00 06/02/2020 23:00 INFO
I am trying using a groupby operation like this:
df_result = df_log.groupby(['job','id','task'])[['start_timestamp', 'end_timestamp','log_level']].sum().reset_index()
This more or less works for start_timestamp and end_timestamp but not for log_level
I have also tried an aggregation
df_result = df_log.groupby(['job','id','task']).agg({'start_timestamp':'sum',
'end_timestamp':'sum', 'level':'sum'}).reset_index()
Is there a better way?
Upvotes: 0
Views: 44
Reputation: 3663
You can try this:
df = df.groupby(['job','id','task'])[['start_timestamp', 'end_timestamp','log_level']].first().reset_index()
print(df)
Output:
job id task start_timestamp end_timestamp log_level
0 job1 1 task_1 06/02/2020 14:00 NaN INFO
1 job1 1 task_2 06/02/2020 14:00 06/02/2020 23:00 INFO
2 job1 1 task_3 06/02/2020 14:00 06/02/2020 23:00 INFO
3 job2 2 task_1 06/02/2020 14:00 06/02/2020 23:00 INFO
Upvotes: 2
Reputation: 323326
I will try
df = df_log.groupby(['job','id','task']).first().reset_index()
Upvotes: 2