Reputation: 678
I have a DataFrame that contains some duplicated index values:
df1 = pd.DataFrame( np.random.randn(6,6), columns = pd.date_range('1/1/2010', periods=6), index = {"A", "B", "C", "D", "E", "F"})
df1.rename(index = {"C": "A", "B": "E"}, inplace = 1)
ipdb> df1
2010-01-01 2010-01-02 2010-01-03 2010-01-04 2010-01-05 2010-01-06
A -1.163883 0.593760 2.323342 -0.928527 0.058336 -0.209101
A -0.593566 -0.894161 -0.789849 1.452725 0.821477 -0.738937
E -0.670305 -1.788403 0.134790 -0.270894 0.672948 1.149089
F 1.707686 0.323213 0.048503 1.168898 0.002662 -1.988825
D 0.403028 -0.879873 -1.809991 -1.817214 -0.012758 0.283450
E -0.224405 -1.803301 0.582946 0.338941 0.798908 0.714560
I would like to change only the name of the duplicated values and to obtain a DataFrame like the following one:
ipdb> df1
2010-01-01 2010-01-02 2010-01-03 2010-01-04 2010-01-05 2010-01-06
A -1.163883 0.593760 2.323342 -0.928527 0.058336 -0.209101
A_dp -0.593566 -0.894161 -0.789849 1.452725 0.821477 -0.738937
E -0.670305 -1.788403 0.134790 -0.270894 0.672948 1.149089
F 1.707686 0.323213 0.048503 1.168898 0.002662 -1.988825
D 0.403028 -0.879873 -1.809991 -1.817214 -0.012758 0.283450
E_dp -0.224405 -1.803301 0.582946 0.338941 0.798908 0.714560
My approach:
(i) Create dictionary with new names
old_names = df1[df1.index.duplicated()].index.values
new_names = df1[df1.index.duplicated()].index.values + "_dp"
dictionary = dict(zip(old_names, new_names))
(ii) Rename only the duplicated values
df1.loc[df1.index.duplicated(),:].rename(index = dictionary, inplace = True)
However this does not seem to work.
Upvotes: 12
Views: 8163
Reputation: 68
Here's a function to rename any duplicate row a column of your dataframe based on its order of appearance. You can just use df = df.reset_index()
before using the function and set column='index'
to use this for your index.
def rename_duplicates(df, column='column_name'):
df = df.copy()
duplicate_rows = df[column].duplicated(keep=False)
duplicate_indices = df[duplicate_rows].index
duplicates_df = df.loc[duplicate_indices, :]
duplicates_df.loc[:, column] = duplicates_df[column] + '_' + duplicates_df.reset_index().index.astype('str')
df.loc[duplicate_indices, column] = duplicates_df[column]
return df
If you have two entries named 'apple' these will now be 'apple_0' and 'apple_1' etc.
Upvotes: 0
Reputation: 2167
Further improvements from @kotrfa answer, we can the function more flaxible using prefix and suffix. This approach also behaves more like when windows file duplicated upon downloads, adding extra (n)
. This approach also support inplace operation from *args
and **kwargs
.
def rename_duplicates(df, prefix=' (', suffix=')', *args, **kwargs):
appendents = (prefix
+ (df.groupby(level=0).cumcount() + 1).astype(str)
+ suffix
).replace(f'{prefix}1{suffix}', '')
return df.set_index(df.index + appendents, *args, **kwargs)
Upvotes: 0
Reputation: 1331
I used jezrael's great answer in this renaming function:
def rn(df, suffix = '-duplicate-'):
appendents = (suffix + df.groupby(level=0).cumcount().astype(str).replace('0','')).replace(suffix, '')
return df.set_index(df.index + appendents)
then this:
df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8, 9]}, index=['a'+str(i) for i in [1,2,3,3,4,3,5,5, 6]])
rn(df)
spits out this:
a
a1 1
a2 2
a3 3
a3-duplicate-1 4
a4 5
a3-duplicate-2 6
a5 7
a5-duplicate-1 8
a6 9
Upvotes: 4
Reputation: 863751
You can use Index.where
:
df1.index = df1.index.where(~df1.index.duplicated(), df1.index + '_dp')
print (df1)
2010-01-01 2010-01-02 2010-01-03 2010-01-04 2010-01-05 2010-01-06
A -1.163883 0.593760 2.323342 -0.928527 0.058336 -0.209101
A_dp -0.593566 -0.894161 -0.789849 1.452725 0.821477 -0.738937
E -0.670305 -1.788403 0.134790 -0.270894 0.672948 1.149089
F 1.707686 0.323213 0.048503 1.168898 0.002662 -1.988825
D 0.403028 -0.879873 -1.809991 -1.817214 -0.012758 0.283450
E_dp -0.224405 -1.803301 0.582946 0.338941 0.798908 0.714560
And if need remove of duplicated index to unique:
print (df1)
2010-01-01 2010-01-02 2010-01-03 2010-01-04 2010-01-05 2010-01-06
A -1.163883 0.593760 2.323342 -0.928527 0.058336 -0.209101
A -0.593566 -0.894161 -0.789849 1.452725 0.821477 -0.738937
E -0.670305 -1.788403 0.134790 -0.270894 0.672948 1.149089
E -0.670305 -1.788403 0.134790 -0.270894 0.672948 1.149089
E -0.670305 -1.788403 0.134790 -0.270894 0.672948 1.149089
F 1.707686 0.323213 0.048503 1.168898 0.002662 -1.988825
D 0.403028 -0.879873 -1.809991 -1.817214 -0.012758 0.283450
E -0.224405 -1.803301 0.582946 0.338941 0.798908 0.714560
df1.index = df1.index + df1.groupby(level=0).cumcount().astype(str).replace('0','')
print (df1)
2010-01-01 2010-01-02 2010-01-03 2010-01-04 2010-01-05 2010-01-06
A -1.163883 0.593760 2.323342 -0.928527 0.058336 -0.209101
A1 -0.593566 -0.894161 -0.789849 1.452725 0.821477 -0.738937
E -0.670305 -1.788403 0.134790 -0.270894 0.672948 1.149089
E1 -0.670305 -1.788403 0.134790 -0.270894 0.672948 1.149089
E2 -0.670305 -1.788403 0.134790 -0.270894 0.672948 1.149089
F 1.707686 0.323213 0.048503 1.168898 0.002662 -1.988825
D 0.403028 -0.879873 -1.809991 -1.817214 -0.012758 0.283450
E3 -0.224405 -1.803301 0.582946 0.338941 0.798908 0.714560
Upvotes: 27