psowa001
psowa001

Reputation: 813

How to extract single string and delete other, similar in DataFrame

I combine category name with skill name to sort it by category name. Now I have table with column as below

(Category1) Skill 1
(Category1) Skill 2
(Category1) Skill 3
(Category1) Skill 4
(Category1) Skill 5
(Category1) Skill 6
(Category2) Skill 7
(Category2) Skill 8
(Category2) Skill 9
(Category2) Skill 10
(Category2) Skill 11
(Category2) Skill 12

I want to leave just one category header per first skill and delete other, similar to have table like this one

(Category1) Skill 1
Skill 2
Skill 3
Skill 4
Skill 5
Skill 6
(Category2) Skill 7
Skill 8
Skill 9
Skill 10
Skill 11
Skill 12

Any ideas? Thanks

Upvotes: 1

Views: 44

Answers (2)

Allen Qin
Allen Qin

Reputation: 19947

Suppose your dataframe(df) column is called 'A':

df2 = df.A.str.split(expand=True)
df2[0]=df2[0].mask(df2[0].eq(df2[0].shift())).fillna('')]
df.A = df2.apply(lambda x: ' '.join(x), axis=1)

Upvotes: 0

yatu
yatu

Reputation: 88236

You could split the strings and retrieve the last part Skill x, as well as check where Categoryx is duplicated, and use the result to replace with the splitted part:

import numpy as np

m = df.col1.str.split(r'\) ', expand=True)
df['col1'] = np.where(m.duplicated(subset=0), m[1], df.col1)

               col1
0   (Category1) Skill 1
1               Skill 2
2               Skill 3
3               Skill 4
4               Skill 5
5               Skill 6
6   (Category2) Skill 7
7               Skill 8
8               Skill 9
9              Skill 10
10             Skill 11
11             Skill 12

Input data -

 col1
0    (Category1) Skill 1
1    (Category1) Skill 2
2    (Category1) Skill 3
3    (Category1) Skill 4
4    (Category1) Skill 5
5    (Category1) Skill 6
6    (Category2) Skill 7
7    (Category2) Skill 8
8    (Category2) Skill 9
9   (Category2) Skill 10
10  (Category2) Skill 11
11  (Category2) Skill 12

Upvotes: 2

Related Questions