Python DataFrame: Remove duplicates based on condition?

Question

I have a df with columns name and subject . I'm trying to remove duplicates for only math value after first value row for each user

            name        subject
    0      mason          first
    1      mason          math
    2      mason          math
    3      mason          first 
    4      mason          chem
    5      mason          math
    6      mason          math
    7       paul          first
    8       paul          chem
    9       paul          first
    10      paul          math
    11      paul          math

Final df

            name        subject
    0      mason          first
    1      mason          math
    2      mason          first 
    3      mason          chem
    4      mason          math
    5       paul          first
    6       paul          chem
    7       paul          first
    8       paul          math

anky · Accepted Answer

Here is one way using a condition used to create a cumulative sum column for a grouper and df.groupby.apply to check the conditions for each group:

c1 = df['subject'].eq("first").cumsum()
out = (df[df.groupby(["name",c1])['subject']
  .apply(lambda x: (~x.duplicated()&x.eq("math")) | x.ne('math'))])

print(out)

     name subject
0   mason   first
1   mason    math
3   mason   first
4   mason    chem
5   mason    math
7    paul   first
8    paul    chem
9    paul   first
10   paul    math

Python DataFrame: Remove duplicates based on condition?

Answers (1)

Related Questions