How to extract single string and delete other, similar in DataFrame

Question

I combine category name with skill name to sort it by category name. Now I have table with column as below

(Category1) Skill 1
(Category1) Skill 2
(Category1) Skill 3
(Category1) Skill 4
(Category1) Skill 5
(Category1) Skill 6
(Category2) Skill 7
(Category2) Skill 8
(Category2) Skill 9
(Category2) Skill 10
(Category2) Skill 11
(Category2) Skill 12

I want to leave just one category header per first skill and delete other, similar to have table like this one

(Category1) Skill 1
Skill 2
Skill 3
Skill 4
Skill 5
Skill 6
(Category2) Skill 7
Skill 8
Skill 9
Skill 10
Skill 11
Skill 12

Any ideas? Thanks

yatu · Accepted Answer

You could split the strings and retrieve the last part Skill x, as well as check where Categoryx is duplicated, and use the result to replace with the splitted part:

import numpy as np

m = df.col1.str.split(r'\) ', expand=True)
df['col1'] = np.where(m.duplicated(subset=0), m[1], df.col1)

               col1
0   (Category1) Skill 1
1               Skill 2
2               Skill 3
3               Skill 4
4               Skill 5
5               Skill 6
6   (Category2) Skill 7
7               Skill 8
8               Skill 9
9              Skill 10
10             Skill 11
11             Skill 12

Input data -

 col1
0    (Category1) Skill 1
1    (Category1) Skill 2
2    (Category1) Skill 3
3    (Category1) Skill 4
4    (Category1) Skill 5
5    (Category1) Skill 6
6    (Category2) Skill 7
7    (Category2) Skill 8
8    (Category2) Skill 9
9   (Category2) Skill 10
10  (Category2) Skill 11
11  (Category2) Skill 12

How to extract single string and delete other, similar in DataFrame

Answers (2)

Related Questions