Omegavirus
Omegavirus

Reputation: 365

Regex for URL C#

In my C# program I wrote a Google Search Function, which works by fetching the source from each page and getting the URLs via regex.

My actual Regex is:

(?:(?:(?:http)://)(?:w{3}\\.)?(?:[a-zA-Z0-9/;\\?&=:\\-_\\$\\+!\\*'\\(\\|\\\\~\\[\\]#%\\.])+)

This works good at the moment, but I get for example URLs like http://www.example.com/forums/arcade.php?efdf=332

I just want to get in this case the URL without the ?efdf=332 at the end.

So how should I change the regex?

Upvotes: 1

Views: 539

Answers (2)

svick
svick

Reputation: 244757

You can use the Uri class to access various parts of the URL and either remove the query string from the end, or concatenate the parts you want.

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336078

http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+

does the same as your regex (I've removed a lot of unnecessary cruft) but stops matching a link before a ?.

In C#:

Regex regexObj = new Regex(@"http://(?:www\.)?[a-zA-Z0-9/;&=:_$+!*'()|~\[\]#%.\\-]+")

That said, I'm not sure this is such a good way of matching URLs (what about https, ftp, mailto etc.?)

Upvotes: 2

Related Questions