Reputation: 5021
I have ingredients for thousands of products for example:
Ingredient = 'Beef stock (beef bones, water, onion, carrot, beef meat, parsnip, thyme, parsley, clove, black pepper, bay leaf), low lactose cream (28%), onion, mustard, modified maize starch,tomato puree, modified potato starch, butter sugar, salt (0,8%), burnt sugar, blackcurrant, peppercorns (black, pink, green, all spice, white) 0,4%.'
I want this ingredient in the form of a list like the following:
listOfIngredients = ['Beef Stock', 'low lactose cream', 'onion', 'mustard', 'modified maize starch','tomato puree', 'modified potato starch', 'butter sugar', 'salt', 'burnt sugar', 'blackcurrant', 'peppercorns']
So in the listOfIngredients I do not have any explanations of the product in percentage or even further products that one ingredient itself contains. Regex is a good way of doing this but I am not good at making regex. Can someone help me in making regex to get the desired output. Thanks in advance.
Upvotes: 1
Views: 105
Reputation: 626903
You might try two approaches.
The first one is to remove all (...)
substrings and anything that is not ,
after (that is not followed with non-word boundary).
\s*\([^()]*\)[^,]*(?:,\b[^,]*)*
See the regex demo
Details:
\s*
- 0+ whitespaces\([^()]*\)
- a (...)
substring having no (
and )
inside:
\(
- a literal (
[^()]*
- 0+ chars other than (
and )
(a [^...]
is a negated character class)[^,]*
- 0+ chars other than ,
(?:,\b[^,]*)*
- zero or more sequences of:
,\b
- a comma that is followed with a letter/digit/underscore[^,]*
- 0+ chars other than ,
.These matches are removed, and then ,\s*
regex is used to split the string with a comma and 0+ whitespaces to get the final result.
The second one is based on matching and capturing words consisting of letters (and _
) only, and just matching (...)
substrings.
\([^()]*\)|([^\W\d]+(?:\s+[^\W\d]+)*)
See the second regex demo
Details:
\([^()]*\)
- a (...)
substring having no (
and )
inside|
- or ([^\W\d]+(?:\s+[^\W\d]+)*)
- Group 1 capturing:
[^\W\d]+
- 1+ letters or underscores (you may add _
after \d
to exclude underscores)(?:\s+[^\W\d]+)*
- 0+ sequences of:
\s+
- 1 or more whitespaces[^\W\d]+
- 1+ letters or underscoresBoth return the same results for the current string, but you may want to adjust it in future.
See Python demo:
import re
Ingredient = 'Beef stock (beef bones, water, onion, carrot, beef meat, parsnip, thyme, parsley, clove, black pepper, bay leaf), low lactose cream (28%), onion, mustard, modified maize starch,tomato puree, modified potato starch, butter sugar, salt (0,8%), burnt sugar, blackcurrant, peppercorns (black, pink, green, all spice, white) 0,4%.'
res = re.sub(r'\s*\([^()]*\)[^,]*(?:,\b[^,]*)*', "", Ingredient)
print(re.split(r',\s*', res))
vals = re.findall(r'\([^()]*\)|([^\W\d]+(?:\s+[^\W\d]+)*)', Ingredient)
vals = [x for x in vals if x]
print(vals)
Upvotes: 1