Drake
Drake

Reputation: 3891

What regular expression is good for extracting URLs from HTML?

I have tried using my own and using the top ones here on StackOverflow, but most of them let matched more than was desired.

For instance, some would extract http://foo.com/hello?world<br (note <br at end) from the input ...http://foo.com/hello?world<br>....

If there a pattern that can match just the URL more reliably?

This is the current pattern I am using:

@"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&^]*)"

Upvotes: 2

Views: 289

Answers (3)

Jason
Jason

Reputation: 3485

Your regex needs an escape for the dash "-" in the last character group:

@"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+\-=\\\.&^]*)"

Essentially, you were allowing characters from + through =, which includes <

Upvotes: 0

Samet S.
Samet S.

Reputation: 475

Try this:

    public static string[] Parse(string pattern, string groupName, string input)
    {
        var list = new List<string>();

        var regex = new Regex(pattern, RegexOptions.IgnoreCase);
        for (var match = regex.Match(input); match.Success; match = match.NextMatch())
        {
            list.Add(string.IsNullOrWhiteSpace(groupName) ? match.Value : match.Groups[groupName].Value);
        }

        return list.ToArray();
    }

    public static string[] ParseUri(string input)
    {
        const string pattern = @"(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*";

        return Parse(pattern, string.Empty, input);
    }

Upvotes: 0

Ritch Melton
Ritch Melton

Reputation: 11598

The most secure regex is to not use a regex at all and use the System.Uri class.

System.Uri

Uri uri = new Uri("http://myUrl/%2E%2E/%2E%2E");
Console.WriteLine(uri.AbsoluteUri);
Console.WriteLine(uri.PathAndQuery);

Upvotes: 3

Related Questions