pandas dataframe create unique ids from column having elements frequency greater than 1

Question

I have following dataframe:

line# key amino0  pos0 amino1  pos1 amino2  pos2
0    14      A    13      M     2      K    14
1    12      A    13      M     2      A     1
2     1      A     1      M     2      P     3
3     2      P     3      P     4      B     6
4     1      A     1      M     2      P    35
5    12      A    31      A    32      M    41
6     1      M    24      P    23      A    22
7    12      A    31      A    32      M    42
8     4      J     5      P     4      B     6
9     3      B     6      I     7      P     4
10    8      B     6      H    10      I     7

I want to update column 'key' with the each occurrence of keys which have frequency>1. My output should look like this:

line# key       amino0  pos0 amino1  pos1 amino2  pos2
0    14_1      A    13      M     2      K    14
1    12_1      A    13      M     2      A     1
2     1_1      A     1      M     2      P     3
3     2_1      P     3      P     4      B     6
4     1_2      A     1      M     2      P    35
5    12_2      A    31      A    32      M    41
6     1_3      M    24      P    23      A    22
7    12_3      A    31      A    32      M    42
8     4_1      J     5      P     4      B     6
9     3_1      B     6      I     7      P     4
10    8_1      B     6      H    10      I     7

For each element in 'key' column, 1st portion is the key, 2nd portion is freq occurrence number. For eg. key 12 has freq 3, therefore, three occurrences of key 12 in three rows will be updated with 12_1, 12_2, 12_3.

The following code is only giving keys with freq >1.

df = pd.read_csv("myfile.txt", sep='	', names = ['key', 'amino0', 'pos0','amino1', 'pos1','amino2', 'pos2'])
vc = df.key.value_counts()
print(vc[vc > 2].index[0])

How to update the keys? Avoiding loop is preferable.

jezrael · Accepted Answer

If type of key column is string use radd:

df['key'] += df.groupby('key').cumcount().add(1).astype(str).radd('_')
#alternative
#df['key'] +=  '_' + df.groupby('key').cumcount().add(1).astype(str)

If integer first is necessary converting:

df['key'] = df['key'].astype(str) + '_' + df.groupby('key').cumcount().add(1).astype(str)
print (df)
    line#   key amino0  pos0 amino1  pos1 amino2  pos2
0       0  14_1      A    13      M     2      K    14
1       1  12_1      A    13      M     2      A     1
2       2   1_1      A     1      M     2      P     3
3       3   2_1      P     3      P     4      B     6
4       4   1_2      A     1      M     2      P    35
5       5  12_2      A    31      A    32      M    41
6       6   1_3      M    24      P    23      A    22
7       7  12_3      A    31      A    32      M    42
8       8   4_1      J     5      P     4      B     6
9       9   3_1      B     6      I     7      P     4
10     10   8_1      B     6      H    10      I     7

Details:

First use GroupBy.cumcount for counter per groups defined by colum key:

print (df.groupby('key').cumcount())
0     0
1     0
2     0
3     0
4     1
5     1
6     2
7     2
8     0
9     0
10    0
dtype: int64

Then add 1 for starting by 1, it is like + 1:

print (df.groupby('key').cumcount().add(1))
0     1
1     1
2     1
3     1
4     2
5     2
6     3
7     3
8     1
9     1
10    1
dtype: int64

For converting to strings use astype, object means obviously string:

print (df.groupby('key').cumcount().add(1).astype(str))
0     1
1     1
2     1
3     1
4     2
5     2
6     3
7     3
8     1
9     1
10    1
dtype: object

pandas dataframe create unique ids from column having elements frequency greater than 1

Answers (1)

Related Questions