Reputation: 12737
I'm trying to solve a regular expression puzzle and I'm ... puzzled. I expect the following:
import re
import fileinput
TEST_DATA = [
"6",
"2 ",
"1 877 2638277 ",
"91-011-23413627"
]
for line in TEST_DATA:
print(
re.sub(
r'(\d{1,3})[- ](\d{2,3})[- ]+(\d{5,10})',
r'CountryCode=\1,LocalAreaCode=\2,Number=\3',
line))
to give me this:
CountryCode=1,LocalAreaCode=877,Number=2638277
CountryCode=91,LocalAreaCode=011,Number=23413627
instead I get this:
6
2
CountryCode=1,LocalAreaCode=877,Number=2638277
CountryCode=91,LocalAreaCode=011,Number=23413627
I don't understand why the lines that don't match are being printed.
Upvotes: 1
Views: 122
Reputation: 54163
I gotta tell ya, I really HATE re.sub
. I don't know why, I don't have a great explanation, but I avoid it like the plague. I can't even really remember ever using it to poor effect, I just don't like it....
The reason it's not producing your expected output is that re.sub
will return the string regardless of whether it matches the regex. It's kind of like "Hello there".replace("foo","bar")
-- just because it doesn't find anything to replace doesn't mean it throws away your string. What I would do instead is this:
pattern = r'(?P<country>\d{1,3})[- ](?P<area>\d{2,3})[- ]+(?P<number>\d{5,10})'
text = r"CountryCode={country},LocalAreaCode={area},number={number}"
for line in TEST_DATA:
match = re.match(pattern,line)
if not match: continue
print(text.format(**match.groupdict()))
Upvotes: 2
Reputation: 9323
try with:
import re
TEST_DATA = [
"6",
"2 ",
"1 877 2638277 ",
"91-011-23413627"
]
pattern = r'(\d{1,3})[- ](\d{2,3})[- ]+(\d{5,10})'
rep = r'CountryCode=\1,LocalAreaCode=\2,Number=\3'
for line in TEST_DATA:
if re.match(pattern, line):
print re.sub(pattern,rep,line)
Upvotes: 0
Reputation: 76194
re.sub
returns the string regardless of whether a replacement occurred. From the documentation:
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
Perhaps you could first check to see if a match occurred, and then perform the replacement.
for line in TEST_DATA:
if re.match(my_pattern, line):
print(
re.sub(
r'(\d{1,3})[- ](\d{2,3})[- ]+(\d{5,10})',
r'CountryCode=\1,LocalAreaCode=\2,Number=\3',
line))
Upvotes: 6