jalazbe
jalazbe

Reputation: 2005

How to combine rows on a pandas DataFrame based on coincidences with other rows

I am working on log analysis and I have manged to get the information as this:

job    id   task        start_timestamp    end_timestamp     log_level 
-----------------------------------------------------------------------
job1    1   task_1       06/02/2020 14:00                     INFO
job1    1   task_2       06/02/2020 14:00                     INFO
job1    1   task_3       06/02/2020 14:00                     INFO
job1    1   task_2                         06/02/2020 23:00   INFO
job1    1   task_3                         06/02/2020 23:00   INFO
job2    2   task_1       06/02/2020 14:00                     INFO
job2    2   task_1                         06/02/2020 23:00   INFO

I am not sure what kind of operation I should perform on the DataFrame to get to this:

job    id   task     start_timestamp       end_timestamp      log_level
-----------------------------------------------------------------------
job1    1   task_1       06/02/2020 14:00                     INFO
job1    1   task_2       06/02/2020 14:00   06/02/2020 23:00  INFO
job1    1   task_3       06/02/2020 14:00   06/02/2020 23:00  INFO
job2    2   task_1       06/02/2020 14:00   06/02/2020 23:00  INFO

I am trying using a groupby operation like this:

df_result = df_log.groupby(['job','id','task'])[['start_timestamp', 'end_timestamp','log_level']].sum().reset_index()

This more or less works for start_timestamp and end_timestamp but not for log_level

I have also tried an aggregation

df_result = df_log.groupby(['job','id','task']).agg({'start_timestamp':'sum',
'end_timestamp':'sum',                                                                'level':'sum'}).reset_index()

Is there a better way?

Upvotes: 0

Views: 44

Answers (2)

Soumendra Mishra
Soumendra Mishra

Reputation: 3663

You can try this:

df = df.groupby(['job','id','task'])[['start_timestamp', 'end_timestamp','log_level']].first().reset_index()
print(df)

Output:

    job  id    task   start_timestamp     end_timestamp log_level
0  job1   1  task_1  06/02/2020 14:00               NaN      INFO
1  job1   1  task_2  06/02/2020 14:00  06/02/2020 23:00      INFO
2  job1   1  task_3  06/02/2020 14:00  06/02/2020 23:00      INFO
3  job2   2  task_1  06/02/2020 14:00  06/02/2020 23:00      INFO

Upvotes: 2

BENY
BENY

Reputation: 323326

I will try

df = df_log.groupby(['job','id','task']).first().reset_index()

Upvotes: 2

Related Questions