Lilly
Lilly

Reputation: 988

Special remove duplicates from dataframe

I am trying to remove duplicate rows but with some special condition. Input data is as below: I want to remove duplicate rows. I have to compare all the columns and hence I have to sort the dataframe based on CODE and hence I can compare the rows. If cell values are same (Have to ignore the upper and lower case while comparing the column CITY) and if the COMPANY column has both BELL and MAS, then I have to choose the with BELL. Thanks.

NAME    CODE   STATE   CITY       COMPANY
abc     109    TN      Trichy     BELL
abc     109    TN      Salem      MAS
pen     9      TN      Mdu        BELL
pen     9      TN      MDU        MAS
hat     10     TN      Karur      MAS
mat     89     TN      Hosur      MAS
mat     89     TN      Hosur      BELL
paper   77     TN      Nyl        BELL
stone   98     TN      Nyl        BELL
stone   98     TN      Nyl        MAS

Results expected:

NAME    CODE   STATE   CITY       COMPANY
abc     109    TN      Trichy     BELL
abc     109    TN      Salem      MAS
pen     9      TN      Mdu        BELL
hat     10     TN      Karur      MAS
mat     89     TN      Hosur      BELL
paper   77     TN      Nyl        BELL
stone   98     TN      Nyl        BELL

Upvotes: 0

Views: 42

Answers (1)

N.Moudgil
N.Moudgil

Reputation: 869

You can try the following steps:

  1. Converting the CITY column to lowercase

  2. Sorting the data according to COMPANY (considering that you need to keep BELL one)

  3. removing duplicates based on other columns excluding COMPANY

    data.CITY =data.CITY.str.lower()
    data.sort_values(by='COMPANY',inplace=True)
    data.drop_duplicates(subset=['NAME','CODE','STATE','CITY'],keep="first")
    

Hope it helps.

Upvotes: 2

Related Questions