Reputation: 4745
Given the following:
from io import StringIO
csv = """\
col1,col2,col3
a,x,0.556281869
b,y,0.24783546399999998
c,z,0.010661738999999998
d,r,0.7982442009999999
a,p,0.642690298
f,q,0.734872757
c,t,0.432764343
a,g,0.37042963
"""
df_from = pd.read_csv(StringIO(csv))
csv = """\
col1,col2,col3
a,x,0.42584888
b,y,0.938256412
c,z,0.339138513
d,r,0.16887348600000002
a_2,p,0.18562532199999998
f,q,0.289833661
c,t,0.108988277
a_3,g,0.861110437
"""
df_to = pd.read_csv(StringIO(csv))
Which looks as
# df_from
col1 col2 col3
0 a x 0.556282
1 b y 0.247835
2 c z 0.010662
3 d r 0.798244
4 a p 0.642690
5 f q 0.734873
6 c t 0.432764
7 a g 0.370430
# df_to
col1 col2 col3
0 a x 0.425849
1 b y 0.938256
2 c z 0.339139
3 d r 0.168873
4 a_2 p 0.185625
5 f q 0.289834
6 c t 0.108988
7 a_3 g 0.861110
I would like to be able to create df_to
based on df_from
, I'm aware of duplicated()
, but I don't know how to go about applying this in a way which will add the suffix as needed, because all though these a
values are duplicated they're distinct things.
Upvotes: 1
Views: 379
Reputation: 863311
Use GroupBy.cumcount
for counter, add 1
because by default is counts from 0
, convert to string, prepend _
and last remove _1
values (first values):
s = df_from.groupby('col1').cumcount().add(1).astype(str)
df_from['col1'] += ('_' + s).replace('_1', '')
print (df_from)
col1 col2 col3
0 a x 0.556282
1 b y 0.247835
2 c z 0.010662
3 d r 0.798244
4 a_2 p 0.642690
5 f q 0.734873
6 c_2 t 0.432764
7 a_3 g 0.370430
Or if want only add new values for duplicated values created by Series.duplicated
use similar solution:
mask = df_from['col1'].duplicated()
s1 = '_' + df_from[mask].groupby('col1').cumcount().add(2).astype(str)
df_from.loc[mask, 'col1'] += s1
print (df_from)
col1 col2 col3
0 a x 0.556282
1 b y 0.247835
2 c z 0.010662
3 d r 0.798244
4 a_2 p 0.642690
5 f q 0.734873
6 c_2 t 0.432764
7 a_3 g 0.370430
Upvotes: 2
Reputation: 323366
We can do it by two steps , 1st groupby
with cumcount
, then mask
s='_'+df_from.groupby('col1').cumcount().add(1).astype(str)
df_from.col1+=s.mask(s=='_1','')
df_from
Out[290]:
col1 col2 col3
0 a x 0.556282
1 b y 0.247835
2 c z 0.010662
3 d r 0.798244
4 a_2 p 0.642690
5 f q 0.734873
6 c_2 t 0.432764
7 a_3 g 0.370430
Upvotes: 3