henri_1310
henri_1310

Reputation: 315

Python regex with negative lookahead

Hi with python I want to catch phone numbers in text but want to exclude the ones which are following the words Fax or fax.

I use the following regex which works if the sentence begin with Fax or fax but it doesn't work if fax is inside the sentence :

^(?!fax|Fax)(?:.*?)(?![-a-z])((?:[^0-9])((\+|00)33\s?|0|\(0\))[123456789][ \.\-]?[0-9]{2}[ \.\-]?[0-9]{2}[ \.\-]?[0-9]{2}[ \.\-]?[0-9]{2})(?![0-9])

Here a example of text I analyse :

text
Adresse quai du Sa fax 06 32 32 32 33 rtel – 59100 ROUBAIX| FRANCE
faTel : 0 8 99 70 1761 – Fax : 06 32 32 32 34
Mail :[email protected]
06 32 32 32 35

Fax 06 32 32 32 36
tel 06 32 32 32 37 henrg

the result of my regex is :

Match 1
Full match  5-42    `Adresse quai du Sa fax 06 32 32 32 33`
Group 1.    27-42   ` 06 32 32 32 33`
Group 2.    28-29   `0`
Match 2
Full match  72-117  `faTel : 0 8 99 70 1761 – Fax : 06 32 32 32 34`
Group 1.    102-117 ` 06 32 32 32 34`
Group 2.    103-104 `0`
Match 3
Full match  118-157 `Mail :[email protected]
06 32 32 32 35`
Group 1.    142-157 `
06 32 32 32 35`
Group 2.    143-144 `0`
Match 4
Full match  178-196 `tel 06 32 32 32 37`
Group 1.    181-196 ` 06 32 32 32 37`
Group 2.    182-183 `0`

But I don't want "06 32 32 32 34" and "06 32 32 32 33" in the result because "fax" is before...

Thanks

Upvotes: 2

Views: 332

Answers (2)

ritchie46
ritchie46

Reputation: 14680

You're using lookahead instead of lookbehind (?<!..)

With this regex I seem to get all the phone numbers and no fax numbers:

(?<!Fax |fax )((\d\d\s){5}|((\d\s){2}(\d\d\s){2}\d{4}))

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626826

I suggest using a regex that will match what you do not need but will match and capture what you need:

(?i)fax\W*\d[\s\d]*|(\d[\s\d]*\d)

See the regex demo. Green highlighted items are what you need to grab. Note: the numbers you will get in Group 1 should contain at least 2 digits. Also, you may precise the patterns as per further requirements, just use the same "framework", as I tried to simplify the regex structure to show the main concept.

Details

  • (?i) - case insensitive modifier
  • fax - the fax substring
  • \W* - any 0+ non-word chars (you may precise it to only work with spaces and colons, like \s*(?::\s*)?)
  • \d - a digit
  • [\s\d]* - 0+ whitespaces or digits
  • | - or...
  • (\d[\s\d]*\d) - Group 1 (the value you need)
    • \d - a digit
    • [\s\d]* - 0+ whitespaces or digits
    • \d - a digit

In Python 2, use

import re
rx = r"(?i)fax\W*\d[\s\d]*|(\d[ \d]*\d)"
s ="text\nAdresse quai du Sa fax 06 32 32 32 33 rtel – 59100 ROUBAIX| FRANCE\nfaTel : 0 8 99 70 1761 – Fax : 06 32 32 32 34\nMail :[email protected]\n06 32 32 32 35\n\nFax 06 32 32 32 36\ntel 06 32 32 32 37 henrg"
res = filter(None, re.findall(rx, s))
print(res)
# => ['59100', '0 8 99 70 1761', '06 32 32 32 35', '06 32 32 32 37']

See the Python 2 demo

Upvotes: 2

Related Questions