Python regex with negative lookahead

Question

Hi with python I want to catch phone numbers in text but want to exclude the ones which are following the words Fax or fax.

I use the following regex which works if the sentence begin with Fax or fax but it doesn't work if fax is inside the sentence :

^(?!fax|Fax)(?:.*?)(?![-a-z])((?:[^0-9])((\+|00)33\s?|0|$0$)[123456789][ \.\-]?[0-9]{2}[ \.\-]?[0-9]{2}[ \.\-]?[0-9]{2}[ \.\-]?[0-9]{2})(?![0-9])

Here a example of text I analyse :

text
Adresse quai du Sa fax 06 32 32 32 33 rtel – 59100 ROUBAIX| FRANCE
faTel : 0 8 99 70 1761 – Fax : 06 32 32 32 34
Mail :support@domain.com
06 32 32 32 35

Fax 06 32 32 32 36
tel 06 32 32 32 37 henrg

the result of my regex is :

Match 1
Full match  5-42    `Adresse quai du Sa fax 06 32 32 32 33`
Group 1.    27-42   ` 06 32 32 32 33`
Group 2.    28-29   `0`
Match 2
Full match  72-117  `faTel : 0 8 99 70 1761 – Fax : 06 32 32 32 34`
Group 1.    102-117 ` 06 32 32 32 34`
Group 2.    103-104 `0`
Match 3
Full match  118-157 `Mail :support@domain.com
06 32 32 32 35`
Group 1.    142-157 `
06 32 32 32 35`
Group 2.    143-144 `0`
Match 4
Full match  178-196 `tel 06 32 32 32 37`
Group 1.    181-196 ` 06 32 32 32 37`
Group 2.    182-183 `0`

But I don't want "06 32 32 32 34" and "06 32 32 32 33" in the result because "fax" is before...

Thanks

Wiktor Stribiżew · Accepted Answer

I suggest using a regex that will match what you do not need but will match and capture what you need:

(?i)fax\W*\d[\s\d]*|(\d[\s\d]*\d)

See the regex demo. Green highlighted items are what you need to grab. Note: the numbers you will get in Group 1 should contain at least 2 digits. Also, you may precise the patterns as per further requirements, just use the same "framework", as I tried to simplify the regex structure to show the main concept.

Details

(?i) - case insensitive modifier
fax - the fax substring
\W* - any 0+ non-word chars (you may precise it to only work with spaces and colons, like \s*(?::\s*)?)
\d - a digit
[\s\d]* - 0+ whitespaces or digits
| - or...
(\d[\s\d]*\d) - Group 1 (the value you need)
- \d - a digit
- [\s\d]* - 0+ whitespaces or digits
- \d - a digit

In Python 2, use

import re
rx = r"(?i)fax\W*\d[\s\d]*|(\d[ \d]*\d)"
s ="text
Adresse quai du Sa fax 06 32 32 32 33 rtel – 59100 ROUBAIX| FRANCE
faTel : 0 8 99 70 1761 – Fax : 06 32 32 32 34
Mail :support@domain.com
06 32 32 32 35

Fax 06 32 32 32 36
tel 06 32 32 32 37 henrg"
res = filter(None, re.findall(rx, s))
print(res)
# => ['59100', '0 8 99 70 1761', '06 32 32 32 35', '06 32 32 32 37']

See the Python 2 demo

Python regex with negative lookahead

Answers (2)

Related Questions