Reputation: 365
In my C# program I wrote a Google Search Function, which works by fetching the source from each page and getting the URLs via regex.
My actual Regex is:
(?:(?:(?:http)://)(?:w{3}\\.)?(?:[a-zA-Z0-9/;\\?&=:\\-_\\$\\+!\\*'\\(\\|\\\\~\\[\\]#%\\.])+)
This works good at the moment, but I get for example URLs like http://www.example.com/forums/arcade.php?efdf=332
I just want to get in this case the URL without the ?efdf=332
at the end.
So how should I change the regex?
Upvotes: 1
Views: 539
Reputation: 244757
You can use the Uri
class to access various parts of the URL and either remove the query string from the end, or concatenate the parts you want.
Upvotes: 0
Reputation: 336078
http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+
does the same as your regex (I've removed a lot of unnecessary cruft) but stops matching a link before a ?
.
In C#:
Regex regexObj = new Regex(@"http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+")
That said, I'm not sure this is such a good way of matching URLs (what about https
, ftp
, mailto
etc.?)
Upvotes: 2