Reputation: 33
I have a large number of strings on the format YYYYYYYYXXXXXXXXZZZZZZZZ, where X, Y, and Z are numbers of fix length, eight digits. Now, the problem is that I need to parse out the middle sequence of integers and remove any leading zeroes. Unfortunately is the only way to determine where each of the three sequences begins/ends is to count the number of digits.
I am currently doing it in two steps, i.e:
m = re.match(
r"(?P<first_sequence>\d{8})"
r"(?P<second_sequence>\d{8})"
r"(?P<third_sequence>\d{8})",
string)
second_secquence = m.group(2)
second_secquence.lstrip(0)
Which does work, and gives me the right results, e.g.:
112233441234567855667788 --> 12345678
112233440012345655667788 --> 123456
112233001234567855667788 --> 12345678
112233000012345655667788 --> 123456
But is there a better method? Is is possible to write a single regex expression which matches against the second sequence, sans the leading zeros?
I guess I am looking for a regex which does the following:
The above solution does work, as mentioned, so the purpose of this problem is more to improve my knowledge of regex. I appreciate any pointers.
Upvotes: 3
Views: 2463
Reputation: 17637
Just to show that it is possible with regex:
https://regex101.com/r/8RSxaH/2
# CODE AUTO GENERATED BY REGEX101.COM (SEE LINK ABOVE)
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?<=\d{8})((?:0*)(\d{,8}))(?=\d{8})"
test_str = ("112233441234567855667788\n"
"112233440012345655667788\n"
"112233001234567855667788\n"
"112233000012345655667788")
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
Although you don't really need it to do what you're asking
Upvotes: 1
Reputation: 321
Agree with the other answers here that regex isn't really required. If you really want to use regex, then \d{8}0*(\d*)\d{8}
should do it.
Upvotes: 2
Reputation: 338208
This is a typical case of "useless use of regular expressions".
Your strings are fixed-length. Just cut them at the appropriate positions.
s = "112233440012345655667788"
int(s[8:16])
# -> 123456
Upvotes: 4
Reputation: 5948
I think it's simpler not to use regex.
result = my_str[8:16].lstrip('0')
Upvotes: 3