Partially split string column in pandas

Question

I have this following data frame in python:

df = pd.DataFrame({'name': ['Vinay', 'Kushal', 'Aman', 'Saif'], 
                   'age': [22, 25, 24, 28], 
                    'occupation': ['A1|A2|A3', 'B1|B2|B3', 'C1|C2|C3', 'D1|D2|D3']})

Please note the field "occupation", its values are separated by a '|'.

I want to add two new columns to the dataframe, lets say new1 & new2, having values A1 & A2, B1 & B2 etc.

I tried to achieve this using following code:

df['new1'] = df['occupation'].str.split("|", n = 2,expand = False)

Result is got is:

    name    age occupation  new1
0   Vinay   22  A1|A2|A3    [A1, A2, A3]
1   Kushal  25  B1|B2|B3    [B1, B2, B3]
2   Aman    24  C1|C2|C3    [C1, C2, C3]
3   Saif    28  D1|D2|D3    [D1, D2, D3]

I do not want to see A1,A2,A3 etc in the new fields. expected output:

        name    age occupation  new1 new2
    0   Vinay   22  A1|A2|A3    [A1] [A2]
    1   Kushal  25  B1|B2|B3    [B1] [B2]
    2   Aman    24  C1|C2|C3    [C1] [C2]
    3   Saif    28  D1|D2|D3    [D1] [D2]

Please suggest the possible solution for the same.

cs95 · Accepted Answer

For performance, use str.split with a list comprehension:

u = pd.DataFrame([
    x.split('|')[:2] for x in df.occupation], columns=['new1', 'new2'], index=df.index)
u

  new1 new2
0   A1   A2
1   B1   B2
2   C1   C2
3   D1   D2

pd.concat([df, u], axis=1)

     name  age occupation new1 new2
0   Vinay   22   A1|A2|A3   A1   A2
1  Kushal   25   B1|B2|B3   B1   B2
2    Aman   24   C1|C2|C3   C1   C2
3    Saif   28   D1|D2|D3   D1   D2

Why is a list comprehension fast here? You can read more at For loops with pandas - When should I care?.

Partially split string column in pandas

Answers (2)

Related Questions