
Reputation: 959

Panda's DataFrame - renaming multiple identically named columns

I have several columns named the same in a df. I need to rename them but the problem is that the df.rename method renames them all the same way. How I can rename the below blah(s) to blah1, blah4, blah5?

df = pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns = ['blah','blah2','blah3','blah','blah']

#     blah  blah2  blah3  blah  blah
# 0   0     1      2      3     4
# 1   5     6      7      8     9

Here is what happens when using the df.rename method:


#     blah1  blah2  blah3  blah1  blah1
# 0   0      1      2      3      4
# 1   5      6      7      8      9

Upvotes: 36

Views: 65755

Answers (14)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210982

We can use the internal (undocumented) method:

In [38]:{'names':df.columns, 'usecols':None})._maybe_dedup_names(df.columns)
Out[38]: ['blah', 'blah2', 'blah3', 'blah.1', 'blah.2']

This is the "magic" function:

   def _maybe_dedup_names(self, names: Sequence[Hashable]) -> Sequence[Hashable]:
        # see gh-7160 and gh-9424: this helps to provide
        # immediate alleviation of the duplicate names
        # issue and appears to be satisfactory to users,
        # but ultimately, not needing to butcher the names
        # would be nice!
        if self.mangle_dupe_cols:
            names = list(names)  # so we can index
            counts: DefaultDict[Hashable, int] = defaultdict(int)
            is_potential_mi = _is_potential_multi_index(names, self.index_col)

            for i, col in enumerate(names):
                cur_count = counts[col]

                while cur_count > 0:
                    counts[col] = cur_count + 1

                    if is_potential_mi:
                        # for mypy
                        assert isinstance(col, tuple)
                        col = col[:-1] + (f"{col[-1]}.{cur_count}",)
                        col = f"{col}.{cur_count}"
                    cur_count = counts[col]

                names[i] = col
                counts[col] = cur_count + 1

        return names

Upvotes: 40


Reputation: 61

In Pandas v2.1 you can use the function, like:

In [137]:, is_potential_multiindex=False)
Out[137]: ['blah', 'blah2', 'blah3', 'blah.1', 'blah.2']

The earlier method ({'names':df.columns, 'usecols':None})._maybe_dedup_names(df.columns)) has been removed so no longer works.

Upvotes: 2

Dance Party
Dance Party

Reputation: 3723

Here's an elegant solution:

Isolate a dataframe with only the repeated columns (looks like it will be a series but it will be a dataframe if >1 column with that name):

df1 = df['blah']

For each "blah" column, give it a unique number

df1.columns = ['blah_' + str(int(x)) for x in range(len(df1.columns))]

Isolate a dataframe with all but the repeated columns:

df2 = df[[x for x in df.columns if x != 'blah']]

Merge back together on indices:

df3 = pd.merge(df1, df2, left_index=True, right_index=True)

Et voila:

   blah_0  blah_1  blah_2  blah2  blah3
0       0       3       4      1      2
1       5       8       9      6      7

Upvotes: 0


Reputation: 21

This is my solution:

cols = []  # for tracking if we alread seen it before
new_cols = []

for col in df.columns:
    count = cols.count(col)
    if count > 1:

df.columns = new_cols 

Upvotes: 0


Reputation: 627

Created a function with some tests so it should be drop in ready; this is a little different than Lamakaha's excellent solution since it renames the first appearance of a duplicate column:

from collections import defaultdict
from typing import Dict, List, Set

import pandas as pd

def rename_duplicate_columns(df: pd.DataFrame) -> pd.DataFrame:
    """Rename column headers to ensure no header names are duplicated.

        df (pd.DataFrame): A dataframe with a single index of columns

        pd.DataFrame: The dataframe with headers renamed; inplace
    if not df.columns.has_duplicates:
        return df
    duplicates: Set[str] = set(df.columns[df.columns.duplicated()].tolist())
    indexes: Dict[str, int] = defaultdict(lambda: 0)
    new_cols: List[str] = []
    for col in df.columns:
        if col in duplicates:
            indexes[col] += 1
    df.columns = new_cols
    return df

def test_rename_duplicate_columns():
    df = pd.DataFrame(data=[[1, 2]], columns=["a", "b"])
    assert rename_duplicate_columns(df).columns.tolist() == ["a", "b"]

    df = pd.DataFrame(data=[[1, 2]], columns=["a", "a"])
    assert rename_duplicate_columns(df).columns.tolist() == ["a.1", "a.2"]

    df = pd.DataFrame(data=[[1, 2, 3]], columns=["a", "b", "a"])
    assert rename_duplicate_columns(df).columns.tolist() == ["a.1", "b", "a.2"]

Upvotes: 1


Reputation: 41

I just wrote this code it uses a list comprehension to update all duplicated names.

df.columns = [x[1] if x[1] not in df.columns[:x[0]] else f"{x[1]}_{list(df.columns[:x[0]]).count(x[1])}" for x in enumerate(df.columns)]

Upvotes: 4

Krishn Kumar
Krishn Kumar

Reputation: 1

We can just assign each column a different name.

Suppoese duplicate column name is like = [a,b,c,d,d,c]

Then just create a list of name what you want to assign:

C = [a,b,c,d,D1,C1]
df.columns = c

This works for me.

Upvotes: 0


Reputation: 5428

Here's a solution that also works for multi-indexes

# Take a df and rename duplicate columns by appending number suffixes
def rename_duplicates(df):
    import copy
    new_columns = df.columns.values
    suffix = {key: 2 for key in set(new_columns)}
    dup = pd.Series(new_columns).duplicated()

    if type(df.columns) == pd.core.indexes.multi.MultiIndex:
        # Need to be mutable, make it list instead of tuples
        for i in range(len(new_columns)):
            new_columns[i] = list(new_columns[i])
        for ix, item in enumerate(new_columns):
            item_orig = copy.copy(item)
            if dup[ix]:
                for level in range(len(new_columns[ix])):
                    new_columns[ix][level] = new_columns[ix][level] + f"_{suffix[tuple(item_orig)]}"
                suffix[tuple(item_orig)] += 1

        for i in range(len(new_columns)):
            new_columns[i] = tuple(new_columns[i])

        df.columns = pd.MultiIndex.from_tuples(new_columns)
    # Not a MultiIndex
        for ix, item in enumerate(new_columns):
            if dup[ix]:
                new_columns[ix] = item + f"_{suffix[item]}"
                suffix[item] += 1
        df.columns = new_columns

Upvotes: 1


Reputation: 959

I was looking to find a solution within Pandas more than a general Python solution. Column's get_loc() function returns a masked array if it finds duplicates with 'True' values pointing to the locations where duplicates are found. I then use the mask to assign new values into those locations. In my case, I know ahead of time how many dups I'm going to get and what I'm going to assign to them but it looks like df.columns.get_duplicates() would return a list of all dups and you can then use that list in conjunction with get_loc() if you need a more generic dup-weeding action


for dup in df.columns[df.columns.duplicated(keep=False)]: 
    cols[df.columns.get_loc(dup)] = ([dup + '.' + str(d_idx) 
                                     if d_idx != 0 
                                     else dup 
                                     for d_idx in range(df.columns.get_loc(dup).sum())]

    blah    blah2   blah3   blah.1  blah.2
 0     0        1       2        3       4
 1     5        6       7        8       9

New Better Method (Update 03Dec2019)

This code below is better than above code. Copied from another answer below (@SatishSK):

#sample df with duplicate blah column

# you just need the following 4 lines to rename duplicates
# df is the dataframe that you want to rename duplicated columns


for dup in cols[cols.duplicated()].unique(): 
    cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]

# rename the columns with the cols list.



    blah    blah2   blah3   blah.1  blah.2
0   0   1   2   3   4
1   5   6   7   8   9

Upvotes: 38

T. Jewell
T. Jewell

Reputation: 61

duplicated_idx = dataset.columns.duplicated()

duplicated = dataset.columns[duplicated_idx].unique()

rename_cols = []

i = 1
for col in dataset.columns:
    if col in duplicated:
        rename_cols.extend([col + '_' + str(i)])

dataset.columns = rename_cols

Upvotes: 3


Reputation: 56

Thank you @Lamakaha for the solution. Your idea gave me a chance to modify it and make it workable in all the cases.

I am using Python 3.7.3 version.

I tried your piece of code on my data set which had only one duplicated column i.e. two columns with same name. Unfortunately, the column names remained As-Is without being renamed. On top of that I got a warning that "get_duplicates() is deprecated and same will be removed in future version". I used duplicated() coupled with unique() in place of get_duplicates() which did not yield the expected result.

I have modified your piece of code little bit which is working for me now for my data set as well as in other general cases as well.

Here are the code runs with and without code modification on the example data set mentioned in the question along with results:




for dup in df.columns.get_duplicates(): 
    cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else dup for d_idx in range(df.columns.get_loc(dup).sum())]


f:\Anaconda3\lib\site-packages\ FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release. You can use idx[idx.duplicated()].unique() instead


    blah    blah2   blah3   blah    blah.1
0   0   1   2   3   4
1   5   6   7   8   9

Two of the three "blah"(s) are not renamed properly.

Modified code



for dup in cols[cols.duplicated()].unique(): 
    cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]



    blah    blah2   blah3   blah.1  blah.2
0   0   1   2   3   4
1   5   6   7   8   9

Here is a run of modified code on some another example:

cols = pd.Series(['X', 'Y', 'Z', 'A', 'B', 'C', 'A', 'A', 'L', 'M', 'A', 'Y', 'M'])

for dup in cols[cols.duplicated()].unique():
    cols[cols[cols == dup].index.values.tolist()] = [dup + '_' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]


0       X
1       Y
2       Z
3       A
4       B
5       C
6     A_1
7     A_2
8       L
9       M
10    A_3
11    Y_1
12    M_1
dtype: object

Hope this helps anybody who is seeking answer to the aforementioned question.

Upvotes: 2


Reputation: 9842

Since the accepted answer (by Lamakaha) is not working for recent versions of pandas, and because the other suggestions looked a bit clumsy, I worked out my own solution:

def dedupIndex(idx, fmt=None, ignoreFirst=True):
    # fmt:          A string format that receives two arguments: 
    #               name and a counter. By default: fmt='%s.%03d'
    # ignoreFirst:  Disable/enable postfixing of first element.
    idx = pd.Series(idx)
    duplicates = idx[idx.duplicated()].unique()
    fmt = '%s.%03d' if fmt is None else fmt
    for name in duplicates:
        dups = idx==name
        ret = [ fmt%(name,i) if (i!=0 or not ignoreFirst) else name
                      for i in range(dups.sum()) ]
        idx.loc[dups] = ret
    return pd.Index(idx)

Use the function as follows:

df.columns = dedupIndex(df.columns)
# Result: ['blah', 'blah2', 'blah3', 'blah.001', 'blah.002']
df.columns = dedupIndex(df.columns, fmt='%s #%d', ignoreFirst=False)
# Result: ['blah #0', 'blah2', 'blah3', 'blah #1', 'blah #2']

Upvotes: 1

Glen Thompson
Glen Thompson

Reputation: 10016

You could use this:

def df_column_uniquify(df):
    df_columns = df.columns
    new_columns = []
    for item in df_columns:
        counter = 0
        newitem = item
        while newitem in new_columns:
            counter += 1
            newitem = "{}_{}".format(item, counter)
    df.columns = new_columns
    return df


import numpy as np
import pandas as pd


so that df:

   blah  blah2  blah3   blah   blah
0     0      1      2      3      4
1     5      6      7      8      9


df = df_column_uniquify(df)

so that df:

   blah  blah2  blah3  blah_1  blah_2
0     0      1      2       3       4
1     5      6      7       8       9

Upvotes: 15


Reputation: 394419

You could assign directly to the columns:

In [12]:

df.columns = ['blah','blah2','blah3','blah4','blah5']
   blah  blah2  blah3  blah4  blah5
0     0      1      2      3      4
1     5      6      7      8      9

[2 rows x 5 columns]

If you want to dynamically just rename the duplicate columns then you could do something like the following (code taken from answer 2: Index of duplicates items in a python list):

In [25]:

import collections
dups = collections.defaultdict(list)
for i, e in enumerate(list(df.columns)):
for k, v in sorted(dups.items()):
  if len(v) >= 2:
    dup_indices = v

for i in dup_indices:
    col_list[i] = col_list[i] + ' ' + str(i)
['blah 0', 'blah2', 'blah3', 'blah 3', 'blah 4']

You could then use this to assign back, you could also have a function to generate a unique name that is not present in the columns prior to renaming.

Upvotes: 4

Related Questions