Reputation: 652
I'm new in regular expression, but I want to match a pattern in about 2 million strings. There three forms of the origin strings shown as follows:
EC-2A-07<EC-1D-10>
EC-2-07
T1-ZJF-4
I want to get three parts of substrings besides -
, which is to say I want to get EC
, 2A
, 07
respectively. Especially, for the first string, I just want to divide the part before <
.
I have tried .+[\d]\W
, but cannot recognize EC-2-07
, then I use .split('-')
to split the string, and then use index in the returned list to get what I want. But it is low efficient.
Can you figure out a high efficient regular expression to meet my requirements?? Thanks a lot!
Upvotes: 0
Views: 89
Reputation: 627082
You need to use
^([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})
See the regex demo
Details:
^
- start of string([A-Z0-9]{2})
- Group 1 capturing 2 uppercase ASCII letters or digits
--
- a hyphen([A-Z0-9]{1,3})
- Group 2 capturing 1 to 3 uppercase ASCII letters or digits-
- a hyphen([A-Z0-9]{1,2})
- Group 3 capturing 1 to 2 uppercase ASCII letters or digits.You may adjust the values in the {min,max}
quantifiers as required.
Sample Python demo:
import re
regex = r"^([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})"
test_str = "EC-2A-07<EC-1D-10>\nEC-2-07\nT1-ZJF-4"
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches)
#or with lines
lines = test_str.split('\n')
rx = re.compile("([A-Z0-9]{2})-([A-Z0-9]{1,3})-([A-Z0-9]{1,2})")
for line in lines:
m = rx.match(line)
if m:
print('{0} :: {1} :: {2}'.format(m.group(1), m.group(2), m.group(3)))
Upvotes: 1