Reputation: 25
Hi I try to extract some information from a few lines in python with a regular expression. What I have now is: ([a-zA-Z()]+\S\S)
My lines are:
Butter 100mg x 12
Butter Organic Jelly 100mg x 7
Butter Soft 100mg x 12
3.5g Organic White Loofi
10g Bubblegum
10 x TST Butter 200yg Hofmann
100 x 10mg Jelly (Test)
With the regex above I get the strings Butter, Butter, Organic, Jelly, Butter, Soft, Organic, White, Loofi, Bubblegum, TST, Butter, Jelly, (Test). But I want the string from every line like: Butter, Butter Organic Jelly, Butter Soft, etc. Not seperated from each other. What do I do wrong?
Upvotes: 0
Views: 83
Reputation: 690
You can use the following regex
((?:(?:[a-zA-Z\(\)]{3,})+[ ]?)+)
It finds words bigger than three that has no digits in them, separated by whitespace characters.
import re
recipe = """
Butter 100mg x 12
Butter Organic Jelly 100mg x 7
Butter Soft 100mg x 12
3.5g Organic White Loofi
10g Bubblegum
10 x TST Butter 200yg Hofmann
100 x 10mg Jelly (Test)
"""
pattern = re.compile('((?:(?:[a-zA-Z\(\)]{3,})+[ ]?)+)')
separated = pattern.findall(recipe)
print separated
>>> ['Butter ', 'Butter Organic Jelly ', 'Butter Soft ', 'Organic White Loofi', 'Bubblegum', 'TST Butter ', 'Hofmann', 'Jelly (Test)']
Upvotes: 0
Reputation: 3060
This regex works for you particular cases: ([A-Z][a-z][A-Za-z()\s]+[a-z)])
What it says is, find a string where:
mg
)TST Butter
and only keep Butter
and not TST
), then This gives me the following matches:
Upvotes: 1