Reputation: 139
I want to extract the following strings from the title column and append to a new column named hazard_extract
like in the below example.
test = {'title': ['Other', 'Microbiological - Listeria', 'Extraneous Material', 'Chemical', 'Chemical - Histamine', 'Labelling, Other'], 'hazard_extract':['Other', 'Microbiological', 'Extraneous Material', 'Chemical', 'Chemical', 'Labelling']}
example = pd.DataFrame(test)
example
title hazard_extract
0 Other Other
1 Microbiological - Listeria Microbiological
2 Extraneous Material Extraneous Material
3 Chemical Chemical
4 Chemical - Histamine Chemical
5 Labelling, Other Labelling
However, I am using the code below - if the string does not have a -
or ,
it does not extract the string. In this case, how can I extract both words as in Extraneous Material
and a single word as in Chemical
or Other
?
example['hazard_extract'] = example['title'].str.extract(r'^(.*?),? ')
title hazard_extract
0 Other NaN
1 Microbiological - Listeria Microbiological
2 Extraneous Material Extraneous
3 Chemical NaN
4 Chemical - Histamine Chemical
5 Labelling, Other Labelling
Thank you so much for all the help!
Upvotes: 1
Views: 146
Reputation: 177471
No need for a complicated regular expression:
import pandas as pd
test = {'title': ['Other', 'Microbiological - Listeria', 'Extraneous Material', 'Chemical', 'Chemical - Histamine', 'Labelling, Other']}
example = pd.DataFrame(test)
print(example)
print()
example['hazard_extract'] = example['title'].str.split(' -|,').str[0]
print(example)
title
0 Other
1 Microbiological - Listeria
2 Extraneous Material
3 Chemical
4 Chemical - Histamine
5 Labelling, Other
title hazard_extract
0 Other Other
1 Microbiological - Listeria Microbiological
2 Extraneous Material Extraneous Material
3 Chemical Chemical
4 Chemical - Histamine Chemical
5 Labelling, Other Labelling
Upvotes: 1
Reputation: 9047
The easiest will be to use split
example['title'].str.split(r'[-,]').str[0].str.strip()
0 Other
1 Microbiological
2 Extraneous Material
3 Chemical
4 Chemical
5 Labelling
Upvotes: 1
Reputation: 1624
Try this:
example['title'].str.extract(r'^(\w*\s*\w*)\s*[\,\-]?.*')
Upvotes: 0