Pandas - mark duplicate elements in a list with an increasing suffix

Question

Given the following:

from io import StringIO

csv = """\
col1,col2,col3
a,x,0.556281869
b,y,0.24783546399999998
c,z,0.010661738999999998
d,r,0.7982442009999999
a,p,0.642690298
f,q,0.734872757
c,t,0.432764343
a,g,0.37042963
"""
df_from = pd.read_csv(StringIO(csv))

csv = """\
col1,col2,col3
a,x,0.42584888
b,y,0.938256412
c,z,0.339138513
d,r,0.16887348600000002
a_2,p,0.18562532199999998
f,q,0.289833661
c,t,0.108988277
a_3,g,0.861110437
"""
df_to = pd.read_csv(StringIO(csv))

Which looks as

# df_from
  col1 col2      col3
0    a    x  0.556282
1    b    y  0.247835
2    c    z  0.010662
3    d    r  0.798244
4    a    p  0.642690
5    f    q  0.734873
6    c    t  0.432764
7    a    g  0.370430

# df_to
  col1 col2      col3
0    a    x  0.425849
1    b    y  0.938256
2    c    z  0.339139
3    d    r  0.168873
4  a_2    p  0.185625
5    f    q  0.289834
6    c    t  0.108988
7  a_3    g  0.861110

I would like to be able to create df_to based on df_from, I'm aware of duplicated(), but I don't know how to go about applying this in a way which will add the suffix as needed, because all though these a values are duplicated they're distinct things.

jezrael · Accepted Answer

Use GroupBy.cumcount for counter, add 1 because by default is counts from 0, convert to string, prepend _ and last remove _1 values (first values):

s = df_from.groupby('col1').cumcount().add(1).astype(str)
df_from['col1'] += ('_' + s).replace('_1', '')
print (df_from)
  col1 col2      col3
0    a    x  0.556282
1    b    y  0.247835
2    c    z  0.010662
3    d    r  0.798244
4  a_2    p  0.642690
5    f    q  0.734873
6  c_2    t  0.432764
7  a_3    g  0.370430

Or if want only add new values for duplicated values created by Series.duplicated use similar solution:

mask = df_from['col1'].duplicated()
s1 = '_' + df_from[mask].groupby('col1').cumcount().add(2).astype(str)
df_from.loc[mask, 'col1'] += s1
print (df_from)
  col1 col2      col3
0    a    x  0.556282
1    b    y  0.247835
2    c    z  0.010662
3    d    r  0.798244
4  a_2    p  0.642690
5    f    q  0.734873
6  c_2    t  0.432764
7  a_3    g  0.370430

Pandas - mark duplicate elements in a list with an increasing suffix

Answers (2)

Related Questions