tsilb
tsilb

Reputation: 8037

This regex matches and shouldn't. Why is it?

This regex:

^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3} (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2} |com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([a-zA-Z0-9\?\=\&\%\/]*)?$

Formatted for readability:

^( # Begin regex / begin address clause
  (https?|ftp)\:(\/\/)|(file\:\/{2,3}))? # protocol
  ( # container for two address formats, more to come later
   ((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
   (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # match IP addresses
  )|( # delimiter for address formats
   ((([a-zA-Z0-9]+)(\.)?)+?) # match domains and any number of subdomains
   (\.) #dot for .com
   ([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum) #TLD clause
  ) # end address clause
([a-zA-Z0-9\?\=\&\%\/]*)? # querystring support, will pretty this up later
$

is matching:

www.google

and shouldn't be. This is one of my "fail" test cases. I have declared the TLD portion of the URL to be mandatory when matching on alpha instead of on IP, and "google" doesn't fit into the "[a-z]{2}" clause.

Keep in mind I will fix the following issues seperately - this question is about why it matches www.google and shouldn't.

What's wrong with my regex?

edit: See also a previous problem with an earlier version of this regex on a different test case: How can I make this regex match correctly?



edit2: Fixed - The corrected regex (as asked) is:

^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3} (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([\/][\/a-zA-Z0-9\.]*)*?([\/]?[\?][a-zA-Z0-9\=\&\%\/]*)?$

Upvotes: 5

Views: 374

Answers (3)

AnthonyWJones
AnthonyWJones

Reputation: 189495

google" doesn't fit into the "[a-z]{2}" clause.

But "go" does and then "ogle" matches "([a-zA-Z0-9\?\=\&\%/]*)?"

Upvotes: 2

Kaivosukeltaja
Kaivosukeltaja

Reputation: 15735

Your TLD clause matches "go" in google and the querystring support part matches "ogle" afterwards. Try changing the querystring part to this:

([?/][a-zA-Z0-9\?\=\&\%\/]*)?

Upvotes: 2

Amber
Amber

Reputation: 527023

"google" might not fit in [a-z]{2}, but it does fit in [a-z]{2}([a-zA-Z0-9\?\=\&\%\/]*)? - you forgot to require a / after the TLD if the URL extends beyond the domain. So it's interpreting it with "www.go" as the domain and then "ogle" following it, with no slash in between. You can fix it by adding a [?/] to the front of that last group to require one of those two symbols between the TLD and any further portion of the URL.

Upvotes: 12

Related Questions