Reputation: 699
I have this following data frame in python:
df = pd.DataFrame({'name': ['Vinay', 'Kushal', 'Aman', 'Saif'],
'age': [22, 25, 24, 28],
'occupation': ['A1|A2|A3', 'B1|B2|B3', 'C1|C2|C3', 'D1|D2|D3']})
Please note the field "occupation", its values are separated by a '|'.
I want to add two new columns to the dataframe, lets say new1 & new2, having values A1 & A2, B1 & B2 etc.
I tried to achieve this using following code:
df['new1'] = df['occupation'].str.split("|", n = 2,expand = False)
Result is got is:
name age occupation new1
0 Vinay 22 A1|A2|A3 [A1, A2, A3]
1 Kushal 25 B1|B2|B3 [B1, B2, B3]
2 Aman 24 C1|C2|C3 [C1, C2, C3]
3 Saif 28 D1|D2|D3 [D1, D2, D3]
I do not want to see A1,A2,A3 etc in the new fields. expected output:
name age occupation new1 new2
0 Vinay 22 A1|A2|A3 [A1] [A2]
1 Kushal 25 B1|B2|B3 [B1] [B2]
2 Aman 24 C1|C2|C3 [C1] [C2]
3 Saif 28 D1|D2|D3 [D1] [D2]
Please suggest the possible solution for the same.
Upvotes: 3
Views: 634
Reputation: 1314
Here is an option which uses regular expressions with named capture groups. You can refer to the docstring for more details by running pd.Series.str.extract?
in an interpreter.
# get the new columns in a separate dataframe
df_ = df['occupation'].str.extract('^(?P<new1>\w{2})\|(?P<new2>\w{2})')
# add brackets around each item in the new dataframe
df_ = df_.applymap(lambda x: '[{}]'.format(x))
# add the new dataframe to your original to get the desired result
df = df.join(df_)
Upvotes: 0
Reputation: 402293
For performance, use str.split
with a list comprehension:
u = pd.DataFrame([
x.split('|')[:2] for x in df.occupation], columns=['new1', 'new2'], index=df.index)
u
new1 new2
0 A1 A2
1 B1 B2
2 C1 C2
3 D1 D2
pd.concat([df, u], axis=1)
name age occupation new1 new2
0 Vinay 22 A1|A2|A3 A1 A2
1 Kushal 25 B1|B2|B3 B1 B2
2 Aman 24 C1|C2|C3 C1 C2
3 Saif 28 D1|D2|D3 D1 D2
Why is a list comprehension fast here? You can read more at For loops with pandas - When should I care?.
Upvotes: 1