Reputation: 1403

C# Regex bug with URL

I'm parsing a file of URL to get the host and URI part but there is a bug when the URL is not finished with a slash.

C# code :

var URL = Regex.Match(link, @"(?:.*?//)?(.*?)(/.*)", RegexOptions.IgnoreCase);

Input :

//cdn.sstatic.net/stackoverflow/img/favicon.ico
/opensearch.xml
http://stackoverflow.com/
http://careers.stackoverflow.com

Output :

//cdn.sstatic.net/stackoverflow/img/favicon.ico has 2 groups:
    cdn.sstatic.net
    /stackoverflow/img/favicon.ico

/opensearch.xml has 2 groups:

    /opensearch.xml

http://stackoverflow.com/ has 2 groups:
    stackoverflow.com
    /
http://careers.stackoverflow.com has 2 groups:
    http:
    //careers.stackoverflow.com

Every URL in the output is valid exept for : http://careers.stackoverflow.com. How can I check for a variable part like "if there is a slash, stop to the first one orelse grab everythings".

Upvotes: 3

Answers (3)

MC ND

Reputation: 70943

Just another option

/(?:.+\/\/|\/\/)?([^\/]*)(\/.+)?/

Upvotes: 1

Samuel Neff

Reputation: 74919

Add |$ to your last group, to match that text or match the end of the expression.

This works for your inputs:

var links = new[]
    {
        "//cdn.sstatic.net/stackoverflow/img/favicon.ico",
        "/opensearch.xml",
        "http://stackoverflow.com/",
        "http://careers.stackoverflow.com"
    };

foreach (string link in links)
{
    var u = Regex.Match(link, @"(?:.*?//)?(.*?)(/.*|$)", RegexOptions.IgnoreCase);
    Console.WriteLine(link);
    Console.WriteLine("    " + u.Groups[1]);
    Console.WriteLine("    " + u.Groups[2]);
    Console.WriteLine();
}

Output:

//cdn.sstatic.net/stackoverflow/img/favicon.ico
    cdn.sstatic.net
    /stackoverflow/img/favicon.ico

/opensearch.xml

    /opensearch.xml

http://stackoverflow.com/
    stackoverflow.com
    /

http://careers.stackoverflow.com
    careers.stackoverflow.com

Upvotes: 1

acfrancis

Reputation: 3681

usr is right that you should use the Uri class but if you insist on using Regex, try using the zero-width positive lookahead assertion like this:

var URL = Regex.Match(link, @"(?:.*?//)?(.*?(?=/|$))(/.*)", RegexOptions.IgnoreCase);

More details at:

http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#zerowidth_positive_lookahead_assertion

Upvotes: -1

C# Regex bug with URL

Answers (3)

Related Questions