Patrick Gates
Patrick Gates

Reputation: 391

url regex issues

I'm using this regex (((ht|f)tp(s?))\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com|edu|gov|mil|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\;\?\'\\\+&%\$#\=~_\-]+))* to search for urls, the only problem, is it's finding "you ca" is a url, how do I change it so there HAS to be a period before the ending (in this case the 'ca') so 'you ca' wont work anymore but 'you.ca' will

Upvotes: 1

Views: 164

Answers (5)

jonesy
jonesy

Reputation: 3542

You can use a quantifier for the period character, so '\.{1}' would require exactly one period before whatever follows.

It's not something that's a necessary part of the debugging of this problem, but it may help to know about it. It's just more explicit, and '{1}' is bigger than a dot, so it also serves as a separator in long, ugly regexes where, during debugging, you might accidentally throw a "+" or "*" next to the dot.

Upvotes: 0

slebetman
slebetman

Reputation: 113866

John Gruber's regexp is the best so far in my experience at finding URLs. See his article on his blog: An Improved Liberal, Accurate Regex Pattern for Matching URLs. It's in use in lots of production code. There's two version: one matches any URL while another only matches http/https URLs.

Upvotes: 0

Norbert de Langen
Norbert de Langen

Reputation: 181

I use a freeware to check my regex: http://www.weitz.de/regex-coach/

perhaps it can be helpfull to you

Upvotes: 0

szbalint
szbalint

Reputation: 1633

Parsing uris with regexes is a hard problem.

Either use a library like Regexp::Common::URI or prepare to spend lots of time investigating a bunch of RFCs. Parsing URIs is entirely not trivial and there are lots of subtle mistakes to be made.

Upvotes: 3

zigdon
zigdon

Reputation: 15063

You forgot to escape the periods in the (www.|[a-zA-Z].) block.

Upvotes: 1

Related Questions