krissy_fong
krissy_fong

Reputation: 15

How do I use df.add_suffix to add suffixes to duplicate column names in Pandas?

I have a large dataframe with 400 columns. 200 of the column names are duplicates of the first 200. How can I used df.add_suffix to add a suffix only to the duplicate column names?

Or is there a better way to do it automatically?

Upvotes: 0

Views: 5070

Answers (5)

Mondher Rouini
Mondher Rouini

Reputation: 1

It will be helpful:

col =  df.columns.values.tolist()
col = [ str(x).replace(x,f'{x}_{cpt}')   for x , cpt in zip(col , range(0,len(col))) ]
df.columns = col

Upvotes: 0

Văn Dũng Lê
Văn Dũng Lê

Reputation: 121

Here is my solution, starting with:

df=pd.DataFrame(np.arange(4).reshape(1,-1),columns=['a','b','a','b'])

output

  a b a b
0 1 2 3 4

Then I use Lambda function

df.columns = df.columns+np.vectorize(lambda x:'_' if x else '')(df.columns.duplicated())

Output

    a   b   a_  b_
0   0   1   2   3

If you have more than one duplicate then you can loop until there is none left. This works for duplicated indices too, it also keeps the index name.

Upvotes: 1

AzwanZ
AzwanZ

Reputation: 1

Add numbering suffix starts with '_1' started with the first duplicated column and applicable to columns appearing more than once.

E.g a column name list: [a, b, c, a, b, a] will return [a, b, c, a_1, b_1, a_2]

from collections import Counter

counter = Counter()
empty_list= []

for x in range(df.shape[1]):
  counter.update([df.columns[x]])
  if counter[df.columns[x]] == 1:
    empty_list.append(df.columns[x])
  else: 
    tx = counter[df.columns[x]] -1
    empty_list.append(df.columns[x] + '_' + str(tx))


df.columns = empty_list
df.columns

Upvotes: 0

mosc9575
mosc9575

Reputation: 6347

If I understand your question correct you have each name twice. If so it is possible to ask for duplicated values using df.columns.duplicated(). Then you can create a new list only modifying duplicated values and adding your self definied suffix. This is different from the other posted solution which modifies all entries.

df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))
my_suffix = 'T'

df.columns = [name if duplicated == False else name + my_suffix for duplicated, name in zip(df.columns.duplicated(), df.columns)]
df
>>>
   a  aT  b  bT
0  1   2  3   4

My answer has the disadvantage that the dataframe can have duplicated column names if one name is used three or more times.

Upvotes: 0

Dani Mesejo
Dani Mesejo

Reputation: 61920

You could do:

import pandas as pd

# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3]], columns=list('aaa'))

# create unique identifier for each repeated column
identifier = df.columns.to_series().groupby(level=0).transform('cumcount')

# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype('string')

print(df)

Output

   a0  a1  a2
0   1   2   3

If there is only one duplicate column, you could do:

# setup dummy DataFrame with repeated columns
df = pd.DataFrame(data=[[1, 2, 3, 4]], columns=list('aabb'))

# create unique identifier for each repeated column
identifier = df.columns.duplicated().astype(int)

# rename columns with the new identifiers
df.columns = df.columns.astype('string') + identifier.astype(str)

print(df)

Output (for only one duplicate)

   a0  a1  b0  b1
0   1   2   3   4

Upvotes: 0

Related Questions