Reputation: 12757

Python re returning non-matching lines

I'm trying to solve a regular expression puzzle and I'm ... puzzled. I expect the following:

import re
import fileinput

TEST_DATA = [
    "6",
    "2 ",
    "1 877 2638277 ",
    "91-011-23413627"
]

for line in TEST_DATA:
    print(
        re.sub(
            r'(\d{1,3})[- ](\d{2,3})[- ]+(\d{5,10})',
            r'CountryCode=\1,LocalAreaCode=\2,Number=\3',
            line))

to give me this:

CountryCode=1,LocalAreaCode=877,Number=2638277 
CountryCode=91,LocalAreaCode=011,Number=23413627

instead I get this:

6
2 
CountryCode=1,LocalAreaCode=877,Number=2638277 
CountryCode=91,LocalAreaCode=011,Number=23413627

I don't understand why the lines that don't match are being printed.

Upvotes: 1

Answers (3)

Adam Smith

Reputation: 54243

I gotta tell ya, I really HATE re.sub. I don't know why, I don't have a great explanation, but I avoid it like the plague. I can't even really remember ever using it to poor effect, I just don't like it....

The reason it's not producing your expected output is that re.sub will return the string regardless of whether it matches the regex. It's kind of like "Hello there".replace("foo","bar") -- just because it doesn't find anything to replace doesn't mean it throws away your string. What I would do instead is this:

pattern = r'(?P<country>\d{1,3})[- ](?P<area>\d{2,3})[- ]+(?P<number>\d{5,10})'
text = r"CountryCode={country},LocalAreaCode={area},number={number}"

for line in TEST_DATA:
    match = re.match(pattern,line)
    if not match: continue
    print(text.format(**match.groupdict()))

Upvotes: 2

markcial

Reputation: 9333

try with:

import re    

TEST_DATA = [
    "6",
    "2 ",
    "1 877 2638277 ",
    "91-011-23413627"
]

pattern = r'(\d{1,3})[- ](\d{2,3})[- ]+(\d{5,10})'
rep = r'CountryCode=\1,LocalAreaCode=\2,Number=\3'

for line in TEST_DATA:
    if re.match(pattern, line):
        print re.sub(pattern,rep,line)

Upvotes: 0

Kevin

Reputation: 76254

re.sub returns the string regardless of whether a replacement occurred. From the documentation:

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.

Perhaps you could first check to see if a match occurred, and then perform the replacement.

for line in TEST_DATA:
    if re.match(my_pattern, line):
        print(
            re.sub(
                r'(\d{1,3})[- ](\d{2,3})[- ]+(\d{5,10})',
                r'CountryCode=\1,LocalAreaCode=\2,Number=\3',
                line))

Upvotes: 6

Python re returning non-matching lines

Answers (3)

Related Questions