Array
Array

Reputation: 25

regex for line of string

Hi I try to extract some information from a few lines in python with a regular expression. What I have now is: ([a-zA-Z()]+\S\S) My lines are:

Butter 100mg x 12
Butter Organic Jelly 100mg x 7
Butter Soft 100mg x 12
3.5g Organic White Loofi
10g Bubblegum
10 x TST Butter 200yg Hofmann
100 x 10mg Jelly (Test)

With the regex above I get the strings Butter, Butter, Organic, Jelly, Butter, Soft, Organic, White, Loofi, Bubblegum, TST, Butter, Jelly, (Test). But I want the string from every line like: Butter, Butter Organic Jelly, Butter Soft, etc. Not seperated from each other. What do I do wrong?

Upvotes: 0

Views: 83

Answers (2)

GLaDOS
GLaDOS

Reputation: 690

You can use the following regex

((?:(?:[a-zA-Z\(\)]{3,})+[ ]?)+)

It finds words bigger than three that has no digits in them, separated by whitespace characters.

import re

recipe = """
Butter 100mg x 12
Butter Organic Jelly 100mg x 7
Butter Soft 100mg x 12
3.5g Organic White Loofi
10g Bubblegum
10 x TST Butter 200yg Hofmann
100 x 10mg Jelly (Test)
"""

pattern = re.compile('((?:(?:[a-zA-Z\(\)]{3,})+[ ]?)+)')
separated = pattern.findall(recipe)

print separated
>>> ['Butter ', 'Butter Organic Jelly ', 'Butter Soft ', 'Organic White Loofi', 'Bubblegum', 'TST Butter ', 'Hofmann', 'Jelly (Test)']

Upvotes: 0

Flavian Hautbois
Flavian Hautbois

Reputation: 3060

This regex works for you particular cases: ([A-Z][a-z][A-Za-z()\s]+[a-z)])

What it says is, find a string where:

  • the first character is an uppercase char (used to get rid of mg)
  • the second a lowercase char (it is used to reject TST Butter and only keep Butter and not TST), then
  • then 0 or more of uppercase, lowercase, parentheses or whitespace
  • the last character is a closing parenthesis or a lowercase char.

This gives me the following matches:

  • Butter
  • Butter Organic Jelly
  • Butter Soft
  • Organic White Loofi
  • Bubblegum
  • Butter
  • Hofmann
  • Jelly (Test)

Upvotes: 1

Related Questions