Miquel
Miquel

Reputation: 678

Rename duplicated index values pandas DataFrame

I have a DataFrame that contains some duplicated index values:

df1 =  pd.DataFrame( np.random.randn(6,6), columns = pd.date_range('1/1/2010', periods=6), index = {"A", "B", "C", "D", "E", "F"})
df1.rename(index = {"C": "A", "B": "E"}, inplace = 1)

ipdb> df1
      2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
 A   -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
 A   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
 E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
 F    1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
 D    0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
 E   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

I would like to change only the name of the duplicated values and to obtain a DataFrame like the following one:

ipdb> df1
     2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A   -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A_dp   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F    1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D    0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E_dp   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

My approach:

(i) Create dictionary with new names

old_names = df1[df1.index.duplicated()].index.values
new_names = df1[df1.index.duplicated()].index.values + "_dp"
dictionary = dict(zip(old_names, new_names))

(ii) Rename only the duplicated values

df1.loc[df1.index.duplicated(),:].rename(index = dictionary, inplace = True)

However this does not seem to work.

Upvotes: 12

Views: 8163

Answers (4)

ThreeOrangeOneRed
ThreeOrangeOneRed

Reputation: 68

Here's a function to rename any duplicate row a column of your dataframe based on its order of appearance. You can just use df = df.reset_index() before using the function and set column='index' to use this for your index.

def rename_duplicates(df, column='column_name'):
    df = df.copy()
    duplicate_rows = df[column].duplicated(keep=False)
    duplicate_indices = df[duplicate_rows].index

    duplicates_df = df.loc[duplicate_indices, :]
    duplicates_df.loc[:, column] = duplicates_df[column] + '_' + duplicates_df.reset_index().index.astype('str')
    df.loc[duplicate_indices, column] = duplicates_df[column]
    return df

If you have two entries named 'apple' these will now be 'apple_0' and 'apple_1' etc.

Upvotes: 0

Muhammad Yasirroni
Muhammad Yasirroni

Reputation: 2167

Further improvements from @kotrfa answer, we can the function more flaxible using prefix and suffix. This approach also behaves more like when windows file duplicated upon downloads, adding extra (n). This approach also support inplace operation from *args and **kwargs.

def rename_duplicates(df, prefix=' (', suffix=')', *args, **kwargs):
    appendents = (prefix
                  + (df.groupby(level=0).cumcount() + 1).astype(str)
                  + suffix
                 ).replace(f'{prefix}1{suffix}', '')
    return df.set_index(df.index + appendents, *args, **kwargs)

Upvotes: 0

kotrfa
kotrfa

Reputation: 1331

I used jezrael's great answer in this renaming function:

def rn(df, suffix = '-duplicate-'):
    appendents = (suffix + df.groupby(level=0).cumcount().astype(str).replace('0','')).replace(suffix, '')
    return df.set_index(df.index + appendents)

then this:

df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8, 9]}, index=['a'+str(i) for i in [1,2,3,3,4,3,5,5, 6]])
rn(df)

spits out this:

    a
a1  1
a2  2
a3  3
a3-duplicate-1  4
a4  5
a3-duplicate-2  6
a5  7
a5-duplicate-1  8
a6  9

Upvotes: 4

jezrael
jezrael

Reputation: 863751

You can use Index.where:

df1.index = df1.index.where(~df1.index.duplicated(), df1.index + '_dp')
print (df1)
      2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A      -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A_dp   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E      -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F       1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D       0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E_dp   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

And if need remove of duplicated index to unique:

print (df1)
   2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A   -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F    1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D    0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

df1.index = df1.index + df1.groupby(level=0).cumcount().astype(str).replace('0','')
print (df1)
    2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A    -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A1   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E    -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E1   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E2   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F     1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D     0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E3   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

Upvotes: 27

Related Questions