Special remove duplicates from dataframe

Question

I am trying to remove duplicate rows but with some special condition. Input data is as below: I want to remove duplicate rows. I have to compare all the columns and hence I have to sort the dataframe based on CODE and hence I can compare the rows. If cell values are same (Have to ignore the upper and lower case while comparing the column CITY) and if the COMPANY column has both BELL and MAS, then I have to choose the with BELL. Thanks.

NAME    CODE   STATE   CITY       COMPANY
abc     109    TN      Trichy     BELL
abc     109    TN      Salem      MAS
pen     9      TN      Mdu        BELL
pen     9      TN      MDU        MAS
hat     10     TN      Karur      MAS
mat     89     TN      Hosur      MAS
mat     89     TN      Hosur      BELL
paper   77     TN      Nyl        BELL
stone   98     TN      Nyl        BELL
stone   98     TN      Nyl        MAS

Results expected:

NAME    CODE   STATE   CITY       COMPANY
abc     109    TN      Trichy     BELL
abc     109    TN      Salem      MAS
pen     9      TN      Mdu        BELL
hat     10     TN      Karur      MAS
mat     89     TN      Hosur      BELL
paper   77     TN      Nyl        BELL
stone   98     TN      Nyl        BELL

N.Moudgil · Accepted Answer

You can try the following steps:

Converting the CITY column to lowercase
Sorting the data according to COMPANY (considering that you need to keep BELL one)

removing duplicates based on other columns excluding COMPANY

data.CITY =data.CITY.str.lower()
data.sort_values(by='COMPANY',inplace=True)
data.drop_duplicates(subset=['NAME','CODE','STATE','CITY'],keep="first")

Hope it helps.

Special remove duplicates from dataframe

Answers (1)

Related Questions