Reputation: 139
I have the following pandas Series:
cfia_recalls_merged['title'].head()
0 One Ocean brand Sliced Smoked Wild Sockeye Salmon recalled due to Listeria monocytogenes
1 Pastene brand Green Olives Sliced recalled due to container integrity defects
2 Casa Italia brand Soppressata Piccante Salami recalled due to possible spoilage
3 Obiji brand Palm Oil recalled due to Sudan IV
4 One Degree Organic Foods brand Gluten Free Sprouted Rolled Oats recalled due to packaging integrity defects and rancidity
Name: title, dtype: object
I want to extract certain parts of each string and append to a new column. Example:
test = {'brand': ['One Ocean', 'Pastene', 'Casa Italia'], 'product': ['Sliced Smoked Wild Sockeye Salmon', 'Green Olives Sliced', 'Soppressata Piccante Salami'], 'hazard': ['Listeria monocytogenes', 'container integrity defects', 'possible spoilage']}
example = pd.DataFrame(test)
example
brand product hazard
0 One Ocean Sliced Smoked Wild Sockeye Salmon Listeria monocytogenes
1 Pastene Green Olives Sliced container integrity defects
2 Casa Italia Soppressata Piccante Salami possible spoilage
Essentially my separator is "brand" and "due to"
How can I do this with regex and capture groups?
Any help is appreciated. Thank you in advance!
Upvotes: 1
Views: 40
Reputation: 520898
You could use str.extract
here:
cfia_recalls_merged['brand'] = cfia_recalls_merged['title'].str.extract(r'^(.*?) brand\b')
cfia_recalls_merged['product'] = cfia_recalls_merged['title'].str.extract(r'^.*? brand (.*?) recalled due to\b')
cfia_recalls_merged['hazard'] = cfia_recalls_merged['title'].str.extract(r'\brecalled due to (.*)$')
Upvotes: 1