Manish
Manish

Reputation: 699

Partially split string column in pandas

I have this following data frame in python:

df = pd.DataFrame({'name': ['Vinay', 'Kushal', 'Aman', 'Saif'], 
                   'age': [22, 25, 24, 28], 
                    'occupation': ['A1|A2|A3', 'B1|B2|B3', 'C1|C2|C3', 'D1|D2|D3']})

Please note the field "occupation", its values are separated by a '|'.

I want to add two new columns to the dataframe, lets say new1 & new2, having values A1 & A2, B1 & B2 etc.

I tried to achieve this using following code:

df['new1'] = df['occupation'].str.split("|", n = 2,expand = False) 

Result is got is:

    name    age occupation  new1
0   Vinay   22  A1|A2|A3    [A1, A2, A3]
1   Kushal  25  B1|B2|B3    [B1, B2, B3]
2   Aman    24  C1|C2|C3    [C1, C2, C3]
3   Saif    28  D1|D2|D3    [D1, D2, D3]

I do not want to see A1,A2,A3 etc in the new fields. expected output:

        name    age occupation  new1 new2
    0   Vinay   22  A1|A2|A3    [A1] [A2]
    1   Kushal  25  B1|B2|B3    [B1] [B2]
    2   Aman    24  C1|C2|C3    [C1] [C2]
    3   Saif    28  D1|D2|D3    [D1] [D2]

Please suggest the possible solution for the same.

Upvotes: 3

Views: 634

Answers (2)

jeschwar
jeschwar

Reputation: 1314

Here is an option which uses regular expressions with named capture groups. You can refer to the docstring for more details by running pd.Series.str.extract? in an interpreter.

# get the new columns in a separate dataframe
df_ = df['occupation'].str.extract('^(?P<new1>\w{2})\|(?P<new2>\w{2})')

# add brackets around each item in the new dataframe
df_ = df_.applymap(lambda x: '[{}]'.format(x))

# add the new dataframe to your original to get the desired result
df = df.join(df_)

Upvotes: 0

cs95
cs95

Reputation: 402293

For performance, use str.split with a list comprehension:

u = pd.DataFrame([
    x.split('|')[:2] for x in df.occupation], columns=['new1', 'new2'], index=df.index)
u

  new1 new2
0   A1   A2
1   B1   B2
2   C1   C2
3   D1   D2

pd.concat([df, u], axis=1)

     name  age occupation new1 new2
0   Vinay   22   A1|A2|A3   A1   A2
1  Kushal   25   B1|B2|B3   B1   B2
2    Aman   24   C1|C2|C3   C1   C2
3    Saif   28   D1|D2|D3   D1   D2

Why is a list comprehension fast here? You can read more at For loops with pandas - When should I care?.

Upvotes: 1

Related Questions