Prolle
Prolle

Reputation: 358

Regex (Python) - Way around a quantifier with a Look-Behind?

I have a list of many elements (all strings but unfortunately lots of whitespace too), here's two elements as an example:

sample_string = '8000KE60803F6                ST FULL-DEPTH TEETH            1 EA           36,56          36,56    2,00           0,73           37,29' ,'8522-3770                    CONTACT            2 EA          311,45         622,90    2,00          12,46          635,36'
my_list = list(sample_string)    

I wish to use regex to extract the first number/letter sequence (in the case of the above, that's 8000KE60803F6 and 8522-3770) I then wish to extract the next alpha sequence (in the case of the above, that's 'ST FULL-DEPTH TEETH' and 'CONTACT') Lastly I wish to extract the numeric value that follows the EA (in the case of the above, that's 36,56 and 311,45)

I have tried the following

for item in my_list:
    line=re.search(r'([A-Z0-9]*)(\s*)((?<=EA\s)[\d,]*)', item)
    if line:
        PN = line.group(1)
        Name = line.group(2)
        Price = line.group(3)
    print(PN)
    print(Name)
    print(Price)

The above outputs

EA

EA

However, I am seeking the following output:

PN: 8000KE60803F6 and 8522-3770

Name: ST FULL-DEPTH TEETH and CONTACT

Price: 36,56 and 311,45

And in reality, need to iterate through a large list.

I have also tried lookarounds, but get the common error when a quantifier is used with them?

Upvotes: 2

Views: 89

Answers (3)

anubhava
anubhava

Reputation: 785761

You may use this regex with 3 captured groups:

(?P<PN>[A-Z\d-]+)\s+(?P<Name>[A-Z]+(?:[\s-]+[A-Z]+)*)\s+[^,]+?EA\s+(?P<Price>\d+(?:,\d+)*)

RegEx Demo

Upvotes: 2

Anakhand
Anakhand

Reputation: 3028

I think this is a good place to use regex groups:

pattern = re.compile(r"^(?P<PN>[\w]+)\s*(?P<name>\w*(\w* )+)\s*\d+\s*EA\s*(?P<price>[\d,]+)")

Notice how each group is separated by arbitrarily many spaces (\s*), and how we name each group (?P<...>).

Then extracting each component is easy:

for string in my_list:
    groups = pattern.match(string).groupdict()
    print(groups["PN"])
    print(groups["name"])
    print(groups["price"])

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627292

You can use

^(?P<PN>\S+)\s+(?P<Name>.*?)\s+\d+\s+EA\s+(?P<Price>\d[\d,]*)

See the regex demo. Details:

  • ^ - start of string
  • (?P<PN>\S+) - Group PN: one or more non-whitespace chars
  • \s+ - one or more whitespaces
  • (?P<Name>.*?) - Group Name: any zero or more chars other than line break chars as few as possible
  • \s+\d+\s+ - one or more digits enclosed with one or more whitespaces
  • EA - an EA string
  • \s+ - one or more whitespaces
  • (?P<Price>\d[\d,]*) - Group Price: a digit and then any zero or more digits or commas.

In Python, you can use it like

import re
rx = re.compile(r'^(?P<PN>\S+)\s+(?P<Name>.*?)\s+\d+\s+EA\s+(?P<Price>\d[\d,]*)')
l = ['8000KE60803F6                ST FULL-DEPTH TEETH            1 EA           36,56          36,56    2,00           0,73           37,29',
'8522-3770                    CONTACT            2 EA          311,45         622,90    2,00          12,46          635,36']
for el in l:
    m = rx.match(el)
    if m:
        print(m.groupdict())
# => {'PN': '8000KE60803F6', 'Name': 'ST FULL-DEPTH TEETH', 'Price': '36,56'}
#    {'PN': '8522-3770', 'Name': 'CONTACT', 'Price': '311,45'}

See the Python demo.

Upvotes: 2

Related Questions