QM.py
QM.py

Reputation: 652

the use of regular expression

I'm new in regular expression, but I want to match a pattern in about 2 million strings. There three forms of the origin strings shown as follows:

EC-2A-07<EC-1D-10>
EC-2-07
T1-ZJF-4

I want to get three parts of substrings besides -, which is to say I want to get EC, 2A, 07respectively. Especially, for the first string, I just want to divide the part before <.

I have tried .+[\d]\W, but cannot recognize EC-2-07, then I use .split('-') to split the string, and then use index in the returned list to get what I want. But it is low efficient.

Can you figure out a high efficient regular expression to meet my requirements?? Thanks a lot!

Upvotes: 0

Views: 89

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You need to use

^([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})

See the regex demo

Details:

  • ^ - start of string
  • ([A-Z0-9]{2}) - Group 1 capturing 2 uppercase ASCII letters or digits -- - a hyphen
  • ([A-Z0-9]{1,3}) - Group 2 capturing 1 to 3 uppercase ASCII letters or digits
  • - - a hyphen
  • ([A-Z0-9]{1,2}) - Group 3 capturing 1 to 2 uppercase ASCII letters or digits.

You may adjust the values in the {min,max} quantifiers as required.

Sample Python demo:

import re
regex = r"^([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})"
test_str = "EC-2A-07<EC-1D-10>\nEC-2-07\nT1-ZJF-4"
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches)
#or with lines
lines = test_str.split('\n')
rx = re.compile("([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})")
for line in lines:
    m = rx.match(line)
    if m:
        print('{0} :: {1} :: {2}'.format(m.group(1), m.group(2), m.group(3)))

Upvotes: 1

Mustofa Rizwan
Mustofa Rizwan

Reputation: 10466

You can try this:

^(\w+)-(\w+)-(\w+)(?=\W).*$

Explanation

Python Demo

Upvotes: 0

Related Questions