abhishah901
abhishah901

Reputation: 568

Change duplicate columns' names

I am trying to change the column name of one set of duplicate columns in the dataframe. I form the dataframe from two different csv files after a bunch of pre-processing.

merged_df.columns is this:

Index(['PortfolioID', 'FirmID', 'PC_City', 'PC_State', 'PC_Country',
   'Investment_Date', 'Exit_Date', 'Transaction_Comments', 'PC_City', 
   'PC_State', 'PC_Country',
   'fundSize'],
  dtype='object')

The first set of PC_City, PC_State, PC_Country is for portfolio companies and the next set is for the firms. I tried this but made both the sets same. Then I saw this but it can be done only when importing the csv. This doesn't talk about duplicates.

Is there any way I can achieve this? Mods please help with the editing.

Upvotes: 3

Views: 456

Answers (2)

Andy L.
Andy L.

Reputation: 25269

Assume you want to add _Firms to duplicated values of column

Sample:

df.columns

Out[372]:
Index(['PortfolioID', 'FirmID', 'PC_City', 'PC_State', 'PC_Country',
       'Investment_Date', 'Exit_Date', 'Transaction_Comments', 'PC_City',
       'PC_State', 'PC_Country', 'fundSize'],
      dtype='object')

You may use where with duplicated as follows:

cols = df.columns
df.columns = cols.where(~cols.duplicated(), cols + '_Firms')

print(df.columns)

Out[380]:
Index(['PortfolioID', 'FirmID', 'PC_City', 'PC_State', 'PC_Country',
       'Investment_Date', 'Exit_Date', 'Transaction_Comments', 'PC_City_Firms',
       'PC_State_Firms', 'PC_Country_Firms', 'fundSize'],
      dtype='object')

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150825

You can try adding the firm on duplicates:

duplicates = merged_df.columns.duplicated()

merged_df.columns = [x+'_firm' if i else x for x, i in zip(merged_df.columns, duplicates)]

Output:

Index(['PortfolioID', 'FirmID', 'PC_City', 'PC_State', 'PC_Country',
       'Investment_Date', 'Exit_Date', 'Transaction_Comments', 'PC_City_firm',
       'PC_State_firm', 'PC_Country_firm', 'fundSize'],
      dtype='object')

However, from you naming, you should consider passing suffixes=('portfolio', 'firm') into merge function that creates merged_df.

Upvotes: 2

Related Questions