Reputation: 33

Remove leading zeros in middle of string with regex

I have a large number of strings on the format YYYYYYYYXXXXXXXXZZZZZZZZ, where X, Y, and Z are numbers of fix length, eight digits. Now, the problem is that I need to parse out the middle sequence of integers and remove any leading zeroes. Unfortunately is the only way to determine where each of the three sequences begins/ends is to count the number of digits.

I am currently doing it in two steps, i.e:

m = re.match(
    r"(?P<first_sequence>\d{8})"
    r"(?P<second_sequence>\d{8})"
    r"(?P<third_sequence>\d{8})",
    string)
second_secquence = m.group(2)
second_secquence.lstrip(0)

Which does work, and gives me the right results, e.g.:

112233441234567855667788 --> 12345678
112233440012345655667788 --> 123456
112233001234567855667788 --> 12345678
112233000012345655667788 --> 123456

But is there a better method? Is is possible to write a single regex expression which matches against the second sequence, sans the leading zeros?

I guess I am looking for a regex which does the following:

Skips over the first eight digits.
Skips any leading zeros.
Captures anything after that, up to the point where there's sixteen characters behind/eight infront.

The above solution does work, as mentioned, so the purpose of this problem is more to improve my knowledge of regex. I appreciate any pointers.

Upvotes: 3

Answers (4)

SierraOscar

Reputation: 17637

Just to show that it is possible with regex:

https://regex101.com/r/8RSxaH/2

# CODE AUTO GENERATED BY REGEX101.COM (SEE LINK ABOVE)
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(?<=\d{8})((?:0*)(\d{,8}))(?=\d{8})"

test_str = ("112233441234567855667788\n"
    "112233440012345655667788\n"
    "112233001234567855667788\n"
    "112233000012345655667788")

matches = re.finditer(regex, test_str)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Although you don't really need it to do what you're asking

Upvotes: 1

JPEG_

Reputation: 321

Agree with the other answers here that regex isn't really required. If you really want to use regex, then \d{8}0*(\d*)\d{8} should do it.

Upvotes: 2

Tomalak

Reputation: 338208

This is a typical case of "useless use of regular expressions".

Your strings are fixed-length. Just cut them at the appropriate positions.

s = "112233440012345655667788"
int(s[8:16])
# -> 123456

Upvotes: 4

lucasnadalutti

Reputation: 5948

I think it's simpler not to use regex.

result = my_str[8:16].lstrip('0')

Upvotes: 3

Remove leading zeros in middle of string with regex

Answers (4)

Related Questions