Muhammad Adeel Zahid
Muhammad Adeel Zahid

Reputation: 17794

Parse International Phone numbers from web pages

I am using HtmlAgilityPack to parse the webpages. once the document is loaded, I want to extract the possible phone numbers from HTML. Currently, I am using some regex for this purpose. I have following piece of code that checks for the match of phone numbers in webpage

    private static string phoneReg =
                @"[\+]{0,1}(\d{10,13}|[\(][\+]{0,1}\d{2,}[\13)]*\d{5,13}|\d{2,6}[\-]{1}\d{2,13}[\-]*\d{3,13})";
            private static Regex phoneRegex = new Regex(phoneReg, RegexOptions.IgnoreCase);
var phoneMatches = phoneRegex.Matches(doci.DocumentNode.InnerText);

where doci is HtmlDocument abstraction from html agility pack. The problem is that it fails to match some phone numbers like 08450 211 211 and +44 (0) 1246 733 000.

Is there a generic regex expression that is most suitable when crawling websites and it allows the matching of most forms of international phone numbers?

Upvotes: 1

Views: 1605

Answers (1)

Oscar Mederos
Oscar Mederos

Reputation: 29863

You cannot match those phone numbers (08450 211 211 and +44 (0) 1246 733 000) because your regex simply doesn't match them.

The first thing you have to do when writing a regular expression is to identify the pattern you want to match.

So, my suggestion is to write down a list of the different phone number formats, update your question, and then we will be able to help you. Otherwise I will always create a new phone number that your regex might not match, or it will just match more than whan you want.

Here is a regex that will match the above phone numbers:

(?:\+\d+\s+\(\d+\)\s+)?\d{4,5}\s+\d{3}\s+\d{3}

Edit:

According to your comment, I would just use this regex, and then remove the ones that are not phone numbers:

(?:\+\d+\s+\(\d+\)\s+)?[\d -]+

Upvotes: 1

Related Questions