Reputation: 17794
I am using HtmlAgilityPack to parse the webpages. once the document is loaded, I want to extract the possible phone numbers from HTML. Currently, I am using some regex for this purpose. I have following piece of code that checks for the match of phone numbers in webpage
private static string phoneReg =
@"[\+]{0,1}(\d{10,13}|[\(][\+]{0,1}\d{2,}[\13)]*\d{5,13}|\d{2,6}[\-]{1}\d{2,13}[\-]*\d{3,13})";
private static Regex phoneRegex = new Regex(phoneReg, RegexOptions.IgnoreCase);
var phoneMatches = phoneRegex.Matches(doci.DocumentNode.InnerText);
where doci
is HtmlDocument
abstraction from html agility pack. The problem is that it fails to match some phone numbers like 08450 211 211
and +44 (0) 1246 733 000
.
Is there a generic regex expression that is most suitable when crawling websites and it allows the matching of most forms of international phone numbers?
Upvotes: 1
Views: 1605
Reputation: 29863
You cannot match those phone numbers (08450 211 211
and +44 (0) 1246 733 000
) because your regex simply doesn't match them.
The first thing you have to do when writing a regular expression is to identify the pattern you want to match.
So, my suggestion is to write down a list of the different phone number formats, update your question, and then we will be able to help you. Otherwise I will always create a new phone number that your regex might not match, or it will just match more than whan you want.
Here is a regex that will match the above phone numbers:
(?:\+\d+\s+\(\d+\)\s+)?\d{4,5}\s+\d{3}\s+\d{3}
Edit:
According to your comment, I would just use this regex, and then remove the ones that are not phone numbers:
(?:\+\d+\s+\(\d+\)\s+)?[\d -]+
Upvotes: 1