Python regex negative lookbehind not failing match

Question

I'm writing a regex to match phone numbers. One of the problems I've encountered is that some postcodes look like phone numbers. For example, in Brazil, postcodes look like this:

30.160-0131

So a simple regex will capture them as false positives:

In [63]: re.search(r"(?P\d+\.\d+-\d+)", "30.160-0131")
Out[63]: <_sre.SRE_Match at 0x102150990>

Luckily, such postcodes often come with a prefix which generally means "postcode", like this:

CEP 30.160-0131

So, if you see CEP in front of something that looks like a phone number, then it's not a phone number - it's a postcode. I've been trying to write a regex to capture that using negative lookbehind, but it's not working. It still matches:

In [62]: re.search(r"(?

Why does it still match, and how can I get the negative look-behind to fail the match?

Jerry · Accepted Answer

You can avoid negative lookaheads if you allow the matching of those postcodes, and still extract only the phone numbers:

m = re.search(r"CEP \d+\.\d+-\d+|(\d+\.\d+-\d+)", s)

And then check if you got something in m.group(1) for the phone numbers.

Little demo with re.findall:

>>> import re
>>> s = "There is a CEP 30.160-0131 and a  30.160-0132 in that sentence, which repeats itself like there is a CEP 30.160-0131 and a  30.160-0132 in that sentence."
>>> m = re.findall(r"CEP \d+\.\d+-\d+|(\d+\.\d+-\d+)", s)
>>> print(m)
['', '30.160-0132', '', '30.160-0132']

And from there, you can filter out the empty strings.

Python regex negative lookbehind not failing match

Answers (2)

Related Questions