Reputation: 107
I have following dataframe:
line# key amino0 pos0 amino1 pos1 amino2 pos2
0 14 A 13 M 2 K 14
1 12 A 13 M 2 A 1
2 1 A 1 M 2 P 3
3 2 P 3 P 4 B 6
4 1 A 1 M 2 P 35
5 12 A 31 A 32 M 41
6 1 M 24 P 23 A 22
7 12 A 31 A 32 M 42
8 4 J 5 P 4 B 6
9 3 B 6 I 7 P 4
10 8 B 6 H 10 I 7
I want to update column 'key' with the each occurrence of keys which have frequency>1. My output should look like this:
line# key amino0 pos0 amino1 pos1 amino2 pos2
0 14_1 A 13 M 2 K 14
1 12_1 A 13 M 2 A 1
2 1_1 A 1 M 2 P 3
3 2_1 P 3 P 4 B 6
4 1_2 A 1 M 2 P 35
5 12_2 A 31 A 32 M 41
6 1_3 M 24 P 23 A 22
7 12_3 A 31 A 32 M 42
8 4_1 J 5 P 4 B 6
9 3_1 B 6 I 7 P 4
10 8_1 B 6 H 10 I 7
For each element in 'key' column, 1st portion is the key, 2nd portion is freq occurrence number. For eg. key 12 has freq 3, therefore, three occurrences of key 12 in three rows will be updated with 12_1, 12_2, 12_3.
The following code is only giving keys with freq >1.
df = pd.read_csv("myfile.txt", sep='\t', names = ['key', 'amino0', 'pos0','amino1', 'pos1','amino2', 'pos2'])
vc = df.key.value_counts()
print(vc[vc > 2].index[0])
How to update the keys? Avoiding loop is preferable.
Upvotes: 3
Views: 50
Reputation: 863266
If type of key
column is string
use radd
:
df['key'] += df.groupby('key').cumcount().add(1).astype(str).radd('_')
#alternative
#df['key'] += '_' + df.groupby('key').cumcount().add(1).astype(str)
If integer first is necessary converting:
df['key'] = df['key'].astype(str) + '_' + df.groupby('key').cumcount().add(1).astype(str)
print (df)
line# key amino0 pos0 amino1 pos1 amino2 pos2
0 0 14_1 A 13 M 2 K 14
1 1 12_1 A 13 M 2 A 1
2 2 1_1 A 1 M 2 P 3
3 3 2_1 P 3 P 4 B 6
4 4 1_2 A 1 M 2 P 35
5 5 12_2 A 31 A 32 M 41
6 6 1_3 M 24 P 23 A 22
7 7 12_3 A 31 A 32 M 42
8 8 4_1 J 5 P 4 B 6
9 9 3_1 B 6 I 7 P 4
10 10 8_1 B 6 H 10 I 7
Details:
First use GroupBy.cumcount
for counter per groups defined by colum key
:
print (df.groupby('key').cumcount())
0 0
1 0
2 0
3 0
4 1
5 1
6 2
7 2
8 0
9 0
10 0
dtype: int64
Then add
1
for starting by 1
, it is like + 1
:
print (df.groupby('key').cumcount().add(1))
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 1
9 1
10 1
dtype: int64
For converting to strings use astype
, object
means obviously string
:
print (df.groupby('key').cumcount().add(1).astype(str))
0 1
1 1
2 1
3 1
4 2
5 2
6 3
7 3
8 1
9 1
10 1
dtype: object
Upvotes: 2