Reputation: 258
I would like to extract the name of the item from the text.
fg['Product'] = pd.Series([' 5 Guys Greasy Burger 3/5LB (24) [51656]', '5 Guys Super Strawberry Shake - (3/4) OZ (9) [5645654], '5 Guys Giant Loaded Double Cheese Burger 1/2LB Buns - 8Z Cups (22) [564654]'])
What I need in the df column for analysis by product
fg['Product'] = 'Greasy Burger', 'Super Strawberry Shake', 'Giant Loaded Double Cheese Burger'
I have tried multiple things, but this got me the closest.
fg['Product'] = fg['Product'].str.strip('5 Guys').str.replace(r'\[d+\]')
But this isn't close to getting me there. The logic in the pattern appears to be strip '5 Guys' and then remove anything after the first numeric digit or the first hyphen '-'. Just can't figure it out.
Upvotes: 0
Views: 1927
Reputation: 3230
r"5 Guys (.*?)(?=[0-9]|-)"
Details:
(.*?)
: Group1 - any character as few as possible(?=[0-9]|-)
: Conditon (when we meet first numeric digit or the first hyphen) to stop regexUpvotes: 0
Reputation: 168
You can apply the regex r"5 Guys ([A-Za-z\s]*)"
to every entry, which selects the group after r"5 Guys "
containing all alphabetical characters and spaces. Maybe you have to find a more sophisticated pattern if there are also names with a number in it. I used an online regex helper for easier pattern creation (e.g. regex101).
Full code example:
import pandas as pd
import re
regex_pattern = r"5 Guys ([A-Za-z\s]*)"
def find_name(full_string):
match = re.search(regex_pattern, full_string)
print(match[1])
s = pd.Series([' 5 Guys Greasy Burger 3/5LB (24) [51656]', '5 Guys Super Strawberry Shake - (3/4) OZ (9) [5645654]', '5 Guys Giant Loaded Double Cheese Burger 1/2LB Buns - 8Z Cups (22) [564654]'])
s.apply(lambda x: find_name(x))
Upvotes: 3