mubular
mubular

Reputation: 45

python regex parsing data from extremely complicated ingredients string

I have a string that contains some terribly formatted ingredients list (shortened for this example):

Vitamin A 6,000iu/kg, Vitamin D3 80iu/kg, Vitamin E PLUS 240iu/kg

I want to break the string down into a list containing the format [label, amount, units]:

[[Vitamin A, 6000, iu/kg], ...]

The problem is there are , in the numbers (ie, 6,000) and as a separator. I can't simply split by comma. The labels can be any number of letters/numbers (ie, Super Duper Vitamin c4390 4.5iu/kg), which makes distinguishing between labels and amount even harder. the units can vary from mg/kg and iu/kg. The list is not restricted to Vitamin ingredients. It can contain other words like Potassium. There are decimals as well.

The other problem is that there are edge cases:

Edge cases: missing space after comma AND some ingredients may be missing a comma separator.

Vitamin A 6,000iu/kg,Vitamin D3 80iu/kg Vitamin E PLUS 240iu/kg, Potassium 3.2mg/kg

The best I can do is this regex:

^(([a-zA-Z]*\s+)*(\d+[,]?\d+)([a-z\/]*))

Which (not quite) matches the first ingredient, and it doesn't handle any of the edge cases. How can I extract the data I want from this messy string?

EDIT:

Here is a real example

Upvotes: 1

Views: 328

Answers (3)

Avinash Raj
Avinash Raj

Reputation: 174796

You could use the below regex. \d+(?:\.\d+)? matches floating point numbers.

(?:^|,)?\s*(.*?)\s+((?:\d+,)?\d+(?:\.\d+)?)([a-z\/]*)

DEMO

>>> s = "Vitamin A 15,000iu/kg,Vitamin D3 2,000iu/kg, Vitamin E 200iu/kg,Zinc Sulphate Monohydrate 417mg/kg, Manganous Oxide 131mg/kg,Ferrous Sulphate Monohydrate 297mg/kg, Calcium Iodate Anhydrous 7.9mg/kg, Sodium Molybdate 6.4mg/kg,Cupric Sulphate Pentahydrate 2.4mg/kg Sodium Selenite 0.2mg/kg"
>>> re.findall(r'(?:^|,)?\s*(.*?)\s+((?:\d+,)?\d+(?:\.\d+)?)([a-z\/]*)', s)
[('Vitamin A', '15,000', 'iu/kg'), ('Vitamin D3', '2,000', 'iu/kg'), ('Vitamin E', '200', 'iu/kg'), ('Zinc Sulphate Monohydrate', '417', 'mg/kg'), ('Manganous Oxide', '131', 'mg/kg'), ('Ferrous Sulphate Monohydrate', '297', 'mg/kg'), ('Calcium Iodate Anhydrous', '7.9', 'mg/kg'), ('Sodium Molybdate', '6.4', 'mg/kg'), ('Cupric Sulphate Pentahydrate', '2.4', 'mg/kg'), ('Sodium Selenite', '0.2', 'mg/kg')]
>>> 

Upvotes: 0

georg
georg

Reputation: 215019

For complex regexes I'd suggest that you use extended mode, multiline literals and named groups. This greatly improves readability. Example:

rx = r"""(?x)

    \s*

    (?P<label>
        [^,]+?
    )

    \s+

    (?P<amount>
        \d\d? (,\d{3})* (\.\d\d?)?
        |
        \d+
    )

    \s*

    (?P<unit>
        [a-z][a-z]?
        /
        [a-z][a-z]?
    )
"""

Usage:

s = "Vitamin A 6,000iu/kg,Vitamin D3 80iu/kg,     Vitamin E PLUS 240iu/kg,W E I R D66      66,666 x/cm"

for x in re.finditer(rx, s):
    print x.groupdict()

Result:

{'amount': '6,000', 'unit': 'iu/kg', 'label': 'Vitamin A'}
{'amount': '80', 'unit': 'iu/kg', 'label': 'Vitamin D3'}
{'amount': '240', 'unit': 'iu/kg', 'label': 'Vitamin E PLUS'}
{'amount': '66,666', 'unit': 'x/cm', 'label': 'W E I R D66'}

Upvotes: 1

vks
vks

Reputation: 67988

([a-zA-Z\s0-9]+)\s+([\d,.]+)([^, ]+)(?=,|$|\s)

Try this.You need not split.Use re.findall.See demo.

http://regex101.com/r/yR3mM3/32

import re
p = re.compile(r'([a-zA-Z\s0-9]+)\s+([\d,.]+)([^, ]+)(?=,|$|\s)', re.MULTILINE | re.IGNORECASE)
test_str = "Vitamin A 6,000iu/kg, Vitamin D3 80iu/kg, Vitamin E PLUS 240iu/k"

re.findall(p, test_str)

Output:[('Vitamin A', '6,000', 'iu/kg'), (' Vitamin D3', '80', 'iu/kg'), (' Vitamin E PLUS', '240', 'iu/k')]

Upvotes: 3

Related Questions