Shekhar Samanta
Shekhar Samanta

Reputation: 905

Python regular expression for phone numbers

I am very new to regular expression & seeking help to parse out phone numbers from HTML text

At source site, the html tags are very distorted & does not have any unique selectors that i can use . Below if the list of possibilities i am looking to parse.

raw = """+49 39291 55-217
02102 7007064
0152 01680970
+49 39291 55-216
02102 3802 22
0800 333004 451-100
+49 221 9937 26950
02151-47974510
+49(0)6105 937 -539
0211/409 2268
+49(0)6105 937 -539
+49211/584-623
0211 58422 2012
+49 (9131) 7-35335
+49 521 9488 2470
+ 49-40-70 70 84 - 0
0211 17 95 99 04
02151-47974327
+49 203 28900 1121
0211 9449-2555
+49 (5 41) 9 98 -2268"""

I tried this pattern but could not make out more from it

import re, requests

Phones = re.findall(re.compile(r'.*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?'),raw)

phones
['102 7007064', '152 0168097', '151-4797451', '937 -539\n0211', '937 -539\n+4921', '584-623\n0211', '151-4797432']

Any advise or help is highly appreciated. Thank you

Upvotes: 1

Views: 3669

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627128

I suggest using this pattern:

(?:\B\+ ?49|\b0)(?: *[(-]? *\d(?:[ \d]*\d)?)? *(?:[)-] *)?\d+ *(?:[/)-] *)?\d+ *(?:[/)-] *)?\d+(?: *- *\d+)?

See the regex demo. Note it is written based on your comment saying the phone numbers starts with +49 or a 0 and on the list of examples you provided. It may be considered "work in progress" since you have not provided more specific rules for phone number extraction.

Pattern details

  • (?:\B\+ ?49|\b0) - a +, optional space, 49 or a 0, both substrings cannot be preceded with a word char
  • (?: *[(-]? *\d(?:[ \d]*\d)?)? - an optional substring matching 0+ spaces, then an optional ( or -, 0+ spaces, a digit and then an optional sequence of digits/spaces followed with a digit
  • *(?:[)-] *)? - 0+ spaces and then an optional sequence of ) or - followed with 0+ spaces
  • \d+ - 1+ digits
  • * - 0+ spaces
  • (?:[/)-] *)? - an optional sequence of /, ) or - followed with 0+ spaces
  • \d+ - 1+ digits
  • *(?:[/)-] *)? - 0+ spaces and then an optional sequence of /, ) or - followed with 0+ spaces
  • \d+ - 1+ digits
  • (?: *- *\d+)? - an optional sequence: 0+ spaces, -, 0+ spaces, 1+ digits.

Upvotes: 5

Related Questions