Michael Gann
Michael Gann

Reputation: 173

How to split a column into two columns by first and last found pattern in Pandas (Python 3.x)

i have a problem with splitting a column into two columns. I want to split the column by the first and last found pattern '-'. Maybe this is trivial.

Here is my column:

        col1
0       aa-bb-cc-dd
1       aa-bb-cc
2       aa-bb-cc
3       aa-bb-cc-dd

This is the frame i want as result:

        col1           col2
0       bb-cc          dd
1       bb             cc
2       bb             cc
3       bb-cc          dd

Thanks in advance!

Upvotes: 2

Views: 384

Answers (5)

John Zwinck
John Zwinck

Reputation: 249434

Here's an idiomatic but slow way to do it:

df.col1 = df.col1.str.split('-', 1).str[1] # discard first part
parts = df.col1.str.rsplit('-', 1).str
df.col1 = parts[0]
df['col2'] = parts[1]

While this works, it is not fast: about 4 seconds for 700k rows. Looking at it you'd think this is a good way to do it, but performance-wise it's worse than all the alternatives.

Upvotes: 2

specbug
specbug

Reputation: 562

This might help:

df['col2'] = df['col1'].split('-')[-1]
df['col1'] = '-'.join(i for i in df['col1'].split('-')[1:-1])

Upvotes: -1

Space Impact
Space Impact

Reputation: 13255

First slice and use str.rsplit and rename:

df = df.col1.str[3:].str.rsplit('-', n=1, expand=True).rename(columns={0:'col1',1:'col2'})

print(df)
    col1 col2
0  bb-cc   dd
1     bb   cc
2     bb   cc
3  bb-cc   dd

Upvotes: 1

jpp
jpp

Reputation: 164773

You can use a list comprehension:

df = pd.DataFrame([i.split('-', 1)[1].rsplit('-', 1) for i in df['col1']],
                  columns=['col1', 'col2'])

print(df)

    col1 col2
0  bb-cc   dd
1     bb   cc
2     bb   cc
3  bb-cc   dd

Pandas str methods exist primarily for convenience. For clean data, you may find the list comprehension more efficient for larger dataframes.

Upvotes: 3

Matina G
Matina G

Reputation: 1582

If I understand well your question, you need to get rid of the first block delimited by a '-', then split the last '-' block in col2. If that is what you need, you could consider this:

df= pd.DataFrame({'col1':['aa-bb-cc-dd', 'aa-bb-cc', 'aa-bb-cc', 'aa-bb-cc-dd']})
df['col2'] = df['col1'].apply(lambda x: x[x.rfind('-')+1:])
df['col1'] =  df['col1'].apply(lambda x: x[x.find('-')+1:x.rfind('-')])
print (df)

Upvotes: 1

Related Questions