Reputation: 21906
I'm writing a regex to match phone numbers. One of the problems I've encountered is that some postcodes look like phone numbers. For example, in Brazil, postcodes look like this:
30.160-0131
So a simple regex will capture them as false positives:
In [63]: re.search(r"(?P<phone>\d+\.\d+-\d+)", "30.160-0131")
Out[63]: <_sre.SRE_Match at 0x102150990>
Luckily, such postcodes often come with a prefix which generally means "postcode", like this:
CEP 30.160-0131
So, if you see CEP in front of something that looks like a phone number, then it's not a phone number - it's a postcode. I've been trying to write a regex to capture that using negative lookbehind, but it's not working. It still matches:
In [62]: re.search(r"(?<!CEP )(\d+\.\d+-\d+)", "CEP 30.160-0131")
Out[62]: <_sre.SRE_Match at 0x102150eb8>
Why does it still match, and how can I get the negative look-behind to fail the match?
Upvotes: 3
Views: 193
Reputation: 71538
You can avoid negative lookaheads if you allow the matching of those postcodes, and still extract only the phone numbers:
m = re.search(r"CEP \d+\.\d+-\d+|(\d+\.\d+-\d+)", s)
And then check if you got something in m.group(1)
for the phone numbers.
Little demo with re.findall
:
>>> import re
>>> s = "There is a CEP 30.160-0131 and a 30.160-0132 in that sentence, which repeats itself like there is a CEP 30.160-0131 and a 30.160-0132 in that sentence."
>>> m = re.findall(r"CEP \d+\.\d+-\d+|(\d+\.\d+-\d+)", s)
>>> print(m)
['', '30.160-0132', '', '30.160-0132']
And from there, you can filter out the empty strings.
Upvotes: 1
Reputation: 97938
The expression matches because you are not doing anything to anchor the number. For example:
"CEP 11.213-132"
will match 1.213-132
since it does not immediately follow CEP
. But you can force a whitespace, or start of line anchor, to be right before the first digit:
re.search(r"(?<!CEP)(?:\s+|^)(\d+\.\d+-\d+)", s)
Upvotes: 3