Reputation: 1403
I'm parsing a file of URL to get the host and URI part but there is a bug when the URL is not finished with a slash.
C# code :
var URL = Regex.Match(link, @"(?:.*?//)?(.*?)(/.*)", RegexOptions.IgnoreCase);
Input :
//cdn.sstatic.net/stackoverflow/img/favicon.ico
/opensearch.xml
http://stackoverflow.com/
http://careers.stackoverflow.com
Output :
//cdn.sstatic.net/stackoverflow/img/favicon.ico has 2 groups:
cdn.sstatic.net
/stackoverflow/img/favicon.ico
/opensearch.xml has 2 groups:
/opensearch.xml
http://stackoverflow.com/ has 2 groups:
stackoverflow.com
/
http://careers.stackoverflow.com has 2 groups:
http:
//careers.stackoverflow.com
Every URL in the output is valid exept for : http://careers.stackoverflow.com. How can I check for a variable part like "if there is a slash, stop to the first one orelse grab everythings".
Upvotes: 3
Views: 168
Reputation: 74919
Add |$
to your last group, to match that text or match the end of the expression.
This works for your inputs:
var links = new[]
{
"//cdn.sstatic.net/stackoverflow/img/favicon.ico",
"/opensearch.xml",
"http://stackoverflow.com/",
"http://careers.stackoverflow.com"
};
foreach (string link in links)
{
var u = Regex.Match(link, @"(?:.*?//)?(.*?)(/.*|$)", RegexOptions.IgnoreCase);
Console.WriteLine(link);
Console.WriteLine(" " + u.Groups[1]);
Console.WriteLine(" " + u.Groups[2]);
Console.WriteLine();
}
Output:
//cdn.sstatic.net/stackoverflow/img/favicon.ico
cdn.sstatic.net
/stackoverflow/img/favicon.ico
/opensearch.xml
/opensearch.xml
http://stackoverflow.com/
stackoverflow.com
/
http://careers.stackoverflow.com
careers.stackoverflow.com
Upvotes: 1
Reputation: 3681
usr is right that you should use the Uri
class but if you insist on using Regex
, try using the zero-width positive lookahead assertion like this:
var URL = Regex.Match(link, @"(?:.*?//)?(.*?(?=/|$))(/.*)", RegexOptions.IgnoreCase);
More details at:
http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#zerowidth_positive_lookahead_assertion
Upvotes: -1