lordy
lordy

Reputation: 630

Splitting a Python data frame string and saving the last split part into new column

I would like to split the strings of a specific column of a dataframe by " - " and save the last part into a new column. This works outside a df:

s0 = '34 years old woman with pain in her XXX - Pharyngitis'
s1 = '67 years old man with xxx - yyy zzz - Nephropathy'
s2 = 'Metastatic Liver Cancer'

print(s0.split(" - ")[-1])  # works
print(s1.split(" - ")[-1])
print(s2.split(" - ")[-1])

But not with a data frame:

df = pd.DataFrame([s0, s1, s2], columns=['title'])
df['diagnosis'] = df['title'].str.split(' - ')[-1]  # KeyError: -1
print(df['diagnosis'])

What am I doing wrong?

Upvotes: 2

Views: 401

Answers (3)

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Instead of splitting a string into a list of chunks - pd.Series.str.rfind is a way to go:

In [104]: df['title'].apply(lambda s: s[s.rfind('-') + 1:].strip())                                         
Out[104]: 
0                Pharyngitis
1                Nephropathy
2    Metastatic Liver Cancer
Name: title, dtype: object

Upvotes: 3

Pritish kumar
Pritish kumar

Reputation: 512

Make a function which does the work of returning the value and then just apply it to that column.

import pandas as pd

s0 = '34 years old woman with pain in her XXX - Pharyngitis'
s1 = '67 years old man with xxx - yyy zzz - Nephropathy'
s2 = 'Metastatic Liver Cancer'

def f(x):
    return x.split(" - ")[-1]

df = pd.DataFrame([s0, s1, s2], columns=['title'])
df['diagnosis'] = df['title'].apply(f) 
print(df['diagnosis'])

Upvotes: 1

emremrah
emremrah

Reputation: 1765

You can use apply and lambda here:

s0 = '34 years old woman with pain in her XXX - Pharyngitis'
s1 = '67 years old man with xxx - yyy zzz - Nephropathy'
s2 = 'Metastatic Liver Cancer'

df = pd.DataFrame([s0, s1, s2], columns=['title'])

df['diagnosis'] = df['title'].apply(lambda x: x.split(' - ')[-1]) 

print(df['diagnosis'])

Prints:

0                Pharyngitis
1                Nephropathy
2    Metastatic Liver Cancer
Name: diagnosis, dtype: object

If you like an empty string if there is no - in the string, change the line to:

df['diagnosis'] = df['title'].apply(lambda x: x.split(' - ')[-1] if ' - ' in x else '')

Upvotes: 1

Related Questions