Converting a string into list of desired tokens using python

Question

I have ingredients for thousands of products for example:

Ingredient = 'Beef stock (beef bones, water, onion, carrot, beef meat, parsnip, thyme, parsley, clove, black pepper, bay leaf), low lactose cream (28%), onion, mustard, modified maize starch,tomato puree, modified potato starch, butter sugar, salt (0,8%), burnt sugar, blackcurrant, peppercorns (black, pink, green, all spice, white) 0,4%.'

I want this ingredient in the form of a list like the following:

listOfIngredients = ['Beef Stock', 'low lactose cream', 'onion', 'mustard', 'modified maize starch','tomato puree', 'modified potato starch', 'butter sugar', 'salt', 'burnt sugar', 'blackcurrant', 'peppercorns']

So in the listOfIngredients I do not have any explanations of the product in percentage or even further products that one ingredient itself contains. Regex is a good way of doing this but I am not good at making regex. Can someone help me in making regex to get the desired output. Thanks in advance.

Wiktor Stribiżew · Accepted Answer

You might try two approaches.

The first one is to remove all (...) substrings and anything that is not , after (that is not followed with non-word boundary).

\s*$[^()]*$[^,]*(?:,\b[^,]*)*

See the regex demo

Details:

\s* - 0+ whitespaces
$[^()]*$ - a (...) substring having no ( and ) inside:
- $ - a literal (
- [^()]* - 0+ chars other than ( and ) (a [^...] is a negated character class)
[^,]* - 0+ chars other than ,
(?:,\b[^,]*)* - zero or more sequences of:
- ,\b - a comma that is followed with a letter/digit/underscore
- [^,]* - 0+ chars other than ,.

These matches are removed, and then ,\s* regex is used to split the string with a comma and 0+ whitespaces to get the final result.

The second one is based on matching and capturing words consisting of letters (and _) only, and just matching (...) substrings.

\([^()]*$|([^\W\d]+(?:\s+[^\W\d]+)*)

See the second regex demo

Details:

$[^()]*$ - a (...) substring having no ( and ) inside
| - or
([^\W\d]+(?:\s+[^\W\d]+)*) - Group 1 capturing:
- [^\W\d]+ - 1+ letters or underscores (you may add _ after \d to exclude underscores)
- (?:\s+[^\W\d]+)* - 0+ sequences of:
  - \s+ - 1 or more whitespaces
  - [^\W\d]+ - 1+ letters or underscores

Both return the same results for the current string, but you may want to adjust it in future.

See Python demo:

import re
Ingredient = 'Beef stock (beef bones, water, onion, carrot, beef meat, parsnip, thyme, parsley, clove, black pepper, bay leaf), low lactose cream (28%), onion, mustard, modified maize starch,tomato puree, modified potato starch, butter sugar, salt (0,8%), burnt sugar, blackcurrant, peppercorns (black, pink, green, all spice, white) 0,4%.'
res = re.sub(r'\s*$[^()]*$[^,]*(?:,\b[^,]*)*', "", Ingredient)
print(re.split(r',\s*', res))

vals = re.findall(r'$[^()]*$|([^\W\d]+(?:\s+[^\W\d]+)*)', Ingredient)
vals = [x for x in vals if x]
print(vals)

Converting a string into list of desired tokens using python

Answers (1)

Related Questions