Lion_YY
Lion_YY

Reputation: 25

How to remove duplicate rows in pandas with multiple conditions

import pandas as pd

df = pd.DataFrame(
    [
        ['China', 'L', '08/06/2022 20:00', '08/10/2022 20:00'],
        ['China', 'L', '8/13/2022 00:54', '8/14/2022 00:54'],
        ['China', 'M', '8/14/2022 00:54', '8/14/2022 12:54'],
        ['United Kingdom', 'L', '8/27/2022 06:36', '8/31/2022 21:08'],
        ['United Kingdom', 'L', '9/01/2022 21:08', '09/02/2022 21:38'],
        ['China', 'D', '09/04/2022 21:38', '09/06/2022 21:38']
    ],
    columns=['Country', 'Function', 'Arrival', 'Departure']
)

In this case, i want to remove the consistent duplicate rows and replace the departure time with the last duplicates value, with below two conditions:

  1. do not remove other duplicates that are not in consistent manner.
  2. if the 'Function' column changed, do not take it as duplicate even it is in consistent manner.

So it should look like this:

df = pd.DataFrame(
    [
        ['China', 'L', '08/06/2022 20:00', '8/14/2022 00:54'],
        ['China', 'M', '8/14/2022 00:54', '8/14/2022 12:54'],
        ['United Kingdom', 'L', '8/27/2022 06:36', '09/02/2022 21:38'],
        ['China', 'D', '09/04/2022 21:38', '09/06/2022 21:38']
    ],
    columns=['Country', 'Function', 'Arrival', 'Departure']
)

Upvotes: 0

Views: 145

Answers (1)

mozway
mozway

Reputation: 260690

You can use groupby.idxmax:

idx = (pd.to_datetime(df['Departure'])
         .groupby([df['Country'], df['Function']], sort=False)
         .idxmax()
       )

out = df.loc[idx]

output:

          Country Function           Arrival         Departure
1           China        L   8/13/2022 00:54   8/14/2022 00:54
2           China        M   8/14/2022 00:54   8/14/2022 12:54
4  United Kingdom        L   9/01/2022 21:08  09/02/2022 21:38
5           China        D  09/04/2022 21:38  09/06/2022 21:38

Upvotes: 1

Related Questions