Python Pandas remove duplicate Code not working

Question

I am trying to remove duplicate columns values in my dataframe.

My code is as below

xls = pd.ExcelFile('Base File.xlsx');

mapping_df = xls.parse('Mapping');
engagement_data_df = xls.parse('Detail Report');
engagement_data_df =engagement_data_df.loc[:,~engagement_data_df.columns.duplicated()]

I have 2 duplicate columns called 'BCS Attached Flag'. I tried to deduplicate the columns with the above code but no luck. Can I ask what I am doing wrong?

Adrian

Edit: It seems that the duplicate column appends an attached .1 behind but in the csv file both the columns BCS Attached Flags are there . I did a print(engagement_data_df.head(10))

Division Region BCS Attached Flag BCSAttached Flag.1 
China   China A Y                  Y    
Singapore Singapore B Y            Y

jezrael · Accepted Answer

I think you need first extract text only and then call duplicated:

m = ~engagement_data_df.columns.str.extract('([a-zA-Z]+)', expand=False).duplicated()
engagement_data_df = engagement_data_df.loc[:, m]

Python Pandas remove duplicate Code not working

Answers (1)

Related Questions