Reputation: 3891
I have tried using my own and using the top ones here on StackOverflow, but most of them let matched more than was desired.
For instance, some would extract http://foo.com/hello?world<br
(note <br
at end) from the input ...http://foo.com/hello?world<br>...
.
If there a pattern that can match just the URL more reliably?
This is the current pattern I am using:
@"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&^]*)"
Upvotes: 2
Views: 289
Reputation: 3485
Your regex needs an escape for the dash "-" in the last character group:
@"((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+\-=\\\.&^]*)"
Essentially, you were allowing characters from + through =, which includes <
Upvotes: 0
Reputation: 475
Try this:
public static string[] Parse(string pattern, string groupName, string input)
{
var list = new List<string>();
var regex = new Regex(pattern, RegexOptions.IgnoreCase);
for (var match = regex.Match(input); match.Success; match = match.NextMatch())
{
list.Add(string.IsNullOrWhiteSpace(groupName) ? match.Value : match.Groups[groupName].Value);
}
return list.ToArray();
}
public static string[] ParseUri(string input)
{
const string pattern = @"(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*";
return Parse(pattern, string.Empty, input);
}
Upvotes: 0
Reputation: 11598
The most secure regex is to not use a regex at all and use the System.Uri class.
Uri uri = new Uri("http://myUrl/%2E%2E/%2E%2E");
Console.WriteLine(uri.AbsoluteUri);
Console.WriteLine(uri.PathAndQuery);
Upvotes: 3