Amanda
Amanda

Reputation: 12737

Python re returning non-matching lines

I'm trying to solve a regular expression puzzle and I'm ... puzzled. I expect the following:

import re
import fileinput

TEST_DATA = [
    "6",
    "2 ",
    "1 877 2638277 ",
    "91-011-23413627"
]

for line in TEST_DATA:
    print(
        re.sub(
            r'(\d{1,3})[- ](\d{2,3})[- ]+(\d{5,10})',
            r'CountryCode=\1,LocalAreaCode=\2,Number=\3',
            line))

to give me this:

CountryCode=1,LocalAreaCode=877,Number=2638277 
CountryCode=91,LocalAreaCode=011,Number=23413627

instead I get this:

6
2 
CountryCode=1,LocalAreaCode=877,Number=2638277 
CountryCode=91,LocalAreaCode=011,Number=23413627

I don't understand why the lines that don't match are being printed.

Upvotes: 1

Views: 122

Answers (3)

Adam Smith
Adam Smith

Reputation: 54163

I gotta tell ya, I really HATE re.sub. I don't know why, I don't have a great explanation, but I avoid it like the plague. I can't even really remember ever using it to poor effect, I just don't like it....

The reason it's not producing your expected output is that re.sub will return the string regardless of whether it matches the regex. It's kind of like "Hello there".replace("foo","bar") -- just because it doesn't find anything to replace doesn't mean it throws away your string. What I would do instead is this:

pattern = r'(?P<country>\d{1,3})[- ](?P<area>\d{2,3})[- ]+(?P<number>\d{5,10})'
text = r"CountryCode={country},LocalAreaCode={area},number={number}"

for line in TEST_DATA:
    match = re.match(pattern,line)
    if not match: continue
    print(text.format(**match.groupdict()))

Upvotes: 2

markcial
markcial

Reputation: 9323

try with:

import re    

TEST_DATA = [
    "6",
    "2 ",
    "1 877 2638277 ",
    "91-011-23413627"
]

pattern = r'(\d{1,3})[- ](\d{2,3})[- ]+(\d{5,10})'
rep = r'CountryCode=\1,LocalAreaCode=\2,Number=\3'

for line in TEST_DATA:
    if re.match(pattern, line):
        print re.sub(pattern,rep,line)

Upvotes: 0

Kevin
Kevin

Reputation: 76194

re.sub returns the string regardless of whether a replacement occurred. From the documentation:

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.

Perhaps you could first check to see if a match occurred, and then perform the replacement.

for line in TEST_DATA:
    if re.match(my_pattern, line):
        print(
            re.sub(
                r'(\d{1,3})[- ](\d{2,3})[- ]+(\d{5,10})',
                r'CountryCode=\1,LocalAreaCode=\2,Number=\3',
                line))

Upvotes: 6

Related Questions