Reputation: 358
I have a list of many elements (all strings but unfortunately lots of whitespace too), here's two elements as an example:
sample_string = '8000KE60803F6 ST FULL-DEPTH TEETH 1 EA 36,56 36,56 2,00 0,73 37,29' ,'8522-3770 CONTACT 2 EA 311,45 622,90 2,00 12,46 635,36'
my_list = list(sample_string)
I wish to use regex to extract the first number/letter sequence (in the case of the above, that's 8000KE60803F6 and 8522-3770) I then wish to extract the next alpha sequence (in the case of the above, that's 'ST FULL-DEPTH TEETH' and 'CONTACT') Lastly I wish to extract the numeric value that follows the EA (in the case of the above, that's 36,56 and 311,45)
I have tried the following
for item in my_list:
line=re.search(r'([A-Z0-9]*)(\s*)((?<=EA\s)[\d,]*)', item)
if line:
PN = line.group(1)
Name = line.group(2)
Price = line.group(3)
print(PN)
print(Name)
print(Price)
The above outputs
EA
EA
However, I am seeking the following output:
PN: 8000KE60803F6 and 8522-3770
Name: ST FULL-DEPTH TEETH and CONTACT
Price: 36,56 and 311,45
And in reality, need to iterate through a large list.
I have also tried lookarounds, but get the common error when a quantifier is used with them?
Upvotes: 2
Views: 89
Reputation: 785761
You may use this regex with 3 captured groups:
(?P<PN>[A-Z\d-]+)\s+(?P<Name>[A-Z]+(?:[\s-]+[A-Z]+)*)\s+[^,]+?EA\s+(?P<Price>\d+(?:,\d+)*)
Upvotes: 2
Reputation: 3028
I think this is a good place to use regex groups:
pattern = re.compile(r"^(?P<PN>[\w]+)\s*(?P<name>\w*(\w* )+)\s*\d+\s*EA\s*(?P<price>[\d,]+)")
Notice how each group is separated by arbitrarily many spaces (\s*
), and how we name each group (?P<...>
).
Then extracting each component is easy:
for string in my_list:
groups = pattern.match(string).groupdict()
print(groups["PN"])
print(groups["name"])
print(groups["price"])
Upvotes: 1
Reputation: 627292
You can use
^(?P<PN>\S+)\s+(?P<Name>.*?)\s+\d+\s+EA\s+(?P<Price>\d[\d,]*)
See the regex demo. Details:
^
- start of string(?P<PN>\S+)
- Group PN
: one or more non-whitespace chars\s+
- one or more whitespaces(?P<Name>.*?)
- Group Name
: any zero or more chars other than line break chars as few as possible\s+\d+\s+
- one or more digits enclosed with one or more whitespacesEA
- an EA
string\s+
- one or more whitespaces(?P<Price>\d[\d,]*)
- Group Price
: a digit and then any zero or more digits or commas.In Python, you can use it like
import re
rx = re.compile(r'^(?P<PN>\S+)\s+(?P<Name>.*?)\s+\d+\s+EA\s+(?P<Price>\d[\d,]*)')
l = ['8000KE60803F6 ST FULL-DEPTH TEETH 1 EA 36,56 36,56 2,00 0,73 37,29',
'8522-3770 CONTACT 2 EA 311,45 622,90 2,00 12,46 635,36']
for el in l:
m = rx.match(el)
if m:
print(m.groupdict())
# => {'PN': '8000KE60803F6', 'Name': 'ST FULL-DEPTH TEETH', 'Price': '36,56'}
# {'PN': '8522-3770', 'Name': 'CONTACT', 'Price': '311,45'}
See the Python demo.
Upvotes: 2