muazfaiz
muazfaiz

Reputation: 5021

Converting a string into list of desired tokens using python

I have ingredients for thousands of products for example:

Ingredient = 'Beef stock (beef bones, water, onion, carrot, beef meat, parsnip, thyme, parsley, clove, black pepper, bay leaf), low lactose cream (28%), onion, mustard, modified maize starch,tomato puree, modified potato starch, butter sugar, salt (0,8%), burnt sugar, blackcurrant, peppercorns (black, pink, green, all spice, white) 0,4%.'

I want this ingredient in the form of a list like the following:

listOfIngredients = ['Beef Stock', 'low lactose cream', 'onion', 'mustard', 'modified maize starch','tomato puree', 'modified potato starch', 'butter sugar', 'salt', 'burnt sugar', 'blackcurrant', 'peppercorns']

So in the listOfIngredients I do not have any explanations of the product in percentage or even further products that one ingredient itself contains. Regex is a good way of doing this but I am not good at making regex. Can someone help me in making regex to get the desired output. Thanks in advance.

Upvotes: 1

Views: 105

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626903

You might try two approaches.

The first one is to remove all (...) substrings and anything that is not , after (that is not followed with non-word boundary).

\s*\([^()]*\)[^,]*(?:,\b[^,]*)*

See the regex demo

Details:

  • \s* - 0+ whitespaces
  • \([^()]*\) - a (...) substring having no ( and ) inside:
    • \( - a literal (
    • [^()]* - 0+ chars other than ( and ) (a [^...] is a negated character class)
  • [^,]* - 0+ chars other than ,
  • (?:,\b[^,]*)* - zero or more sequences of:
    • ,\b - a comma that is followed with a letter/digit/underscore
    • [^,]* - 0+ chars other than ,.

These matches are removed, and then ,\s* regex is used to split the string with a comma and 0+ whitespaces to get the final result.

The second one is based on matching and capturing words consisting of letters (and _) only, and just matching (...) substrings.

\([^()]*\)|([^\W\d]+(?:\s+[^\W\d]+)*)

See the second regex demo

Details:

  • \([^()]*\) - a (...) substring having no ( and ) inside
  • | - or
  • ([^\W\d]+(?:\s+[^\W\d]+)*) - Group 1 capturing:
    • [^\W\d]+ - 1+ letters or underscores (you may add _ after \d to exclude underscores)
    • (?:\s+[^\W\d]+)* - 0+ sequences of:
      • \s+ - 1 or more whitespaces
      • [^\W\d]+ - 1+ letters or underscores

Both return the same results for the current string, but you may want to adjust it in future.

See Python demo:

import re
Ingredient = 'Beef stock (beef bones, water, onion, carrot, beef meat, parsnip, thyme, parsley, clove, black pepper, bay leaf), low lactose cream (28%), onion, mustard, modified maize starch,tomato puree, modified potato starch, butter sugar, salt (0,8%), burnt sugar, blackcurrant, peppercorns (black, pink, green, all spice, white) 0,4%.'
res = re.sub(r'\s*\([^()]*\)[^,]*(?:,\b[^,]*)*', "", Ingredient)
print(re.split(r',\s*', res))

vals = re.findall(r'\([^()]*\)|([^\W\d]+(?:\s+[^\W\d]+)*)', Ingredient)
vals = [x for x in vals if x]
print(vals)

Upvotes: 1

Related Questions