Reputation: 45
I have a string that contains some terribly formatted ingredients list (shortened for this example):
Vitamin A 6,000iu/kg, Vitamin D3 80iu/kg, Vitamin E PLUS 240iu/kg
I want to break the string down into a list containing the format [label, amount, units]
:
[[Vitamin A, 6000, iu/kg], ...]
The problem is there are ,
in the numbers (ie, 6,000
) and as a separator. I can't simply split by comma. The labels can be any number of letters/numbers (ie, Super Duper Vitamin c4390 4.5iu/kg
), which makes distinguishing between labels and amount even harder. the units can vary from mg/kg
and iu/kg
. The list is not restricted to Vitamin
ingredients. It can contain other words like Potassium
. There are decimals as well.
The other problem is that there are edge cases:
Edge cases: missing space after comma AND some ingredients may be missing a comma separator.
Vitamin A 6,000iu/kg,Vitamin D3 80iu/kg Vitamin E PLUS 240iu/kg, Potassium 3.2mg/kg
The best I can do is this regex:
^(([a-zA-Z]*\s+)*(\d+[,]?\d+)([a-z\/]*))
Which (not quite) matches the first ingredient, and it doesn't handle any of the edge cases. How can I extract the data I want from this messy string?
EDIT:
Upvotes: 1
Views: 328
Reputation: 174796
You could use the below regex. \d+(?:\.\d+)?
matches floating point numbers.
(?:^|,)?\s*(.*?)\s+((?:\d+,)?\d+(?:\.\d+)?)([a-z\/]*)
>>> s = "Vitamin A 15,000iu/kg,Vitamin D3 2,000iu/kg, Vitamin E 200iu/kg,Zinc Sulphate Monohydrate 417mg/kg, Manganous Oxide 131mg/kg,Ferrous Sulphate Monohydrate 297mg/kg, Calcium Iodate Anhydrous 7.9mg/kg, Sodium Molybdate 6.4mg/kg,Cupric Sulphate Pentahydrate 2.4mg/kg Sodium Selenite 0.2mg/kg"
>>> re.findall(r'(?:^|,)?\s*(.*?)\s+((?:\d+,)?\d+(?:\.\d+)?)([a-z\/]*)', s)
[('Vitamin A', '15,000', 'iu/kg'), ('Vitamin D3', '2,000', 'iu/kg'), ('Vitamin E', '200', 'iu/kg'), ('Zinc Sulphate Monohydrate', '417', 'mg/kg'), ('Manganous Oxide', '131', 'mg/kg'), ('Ferrous Sulphate Monohydrate', '297', 'mg/kg'), ('Calcium Iodate Anhydrous', '7.9', 'mg/kg'), ('Sodium Molybdate', '6.4', 'mg/kg'), ('Cupric Sulphate Pentahydrate', '2.4', 'mg/kg'), ('Sodium Selenite', '0.2', 'mg/kg')]
>>>
Upvotes: 0
Reputation: 215019
For complex regexes I'd suggest that you use extended mode, multiline literals and named groups. This greatly improves readability. Example:
rx = r"""(?x)
\s*
(?P<label>
[^,]+?
)
\s+
(?P<amount>
\d\d? (,\d{3})* (\.\d\d?)?
|
\d+
)
\s*
(?P<unit>
[a-z][a-z]?
/
[a-z][a-z]?
)
"""
Usage:
s = "Vitamin A 6,000iu/kg,Vitamin D3 80iu/kg, Vitamin E PLUS 240iu/kg,W E I R D66 66,666 x/cm"
for x in re.finditer(rx, s):
print x.groupdict()
Result:
{'amount': '6,000', 'unit': 'iu/kg', 'label': 'Vitamin A'}
{'amount': '80', 'unit': 'iu/kg', 'label': 'Vitamin D3'}
{'amount': '240', 'unit': 'iu/kg', 'label': 'Vitamin E PLUS'}
{'amount': '66,666', 'unit': 'x/cm', 'label': 'W E I R D66'}
Upvotes: 1
Reputation: 67988
([a-zA-Z\s0-9]+)\s+([\d,.]+)([^, ]+)(?=,|$|\s)
Try this.You need not split.Use re.findall
.See demo.
http://regex101.com/r/yR3mM3/32
import re
p = re.compile(r'([a-zA-Z\s0-9]+)\s+([\d,.]+)([^, ]+)(?=,|$|\s)', re.MULTILINE | re.IGNORECASE)
test_str = "Vitamin A 6,000iu/kg, Vitamin D3 80iu/kg, Vitamin E PLUS 240iu/k"
re.findall(p, test_str)
Output:[('Vitamin A', '6,000', 'iu/kg'), (' Vitamin D3', '80', 'iu/kg'), (' Vitamin E PLUS', '240', 'iu/k')]
Upvotes: 3