Reputation: 22515
How do I match an URL that matches all of these conditions:
example.com/FIRST_URL_TOKEN
)example.com/FIRST_URL_TOKEN/SUBSEQUENT_URL_TOKEN/SUBSEQUENT_URL_TOKEN
) So:
http://example.com/test
should match
http://blog.example.com/test
should not match
http://example.com/test/blog/test
should not match
http://example.com/test/test2
should match
Here is what I have so far:
regex = /^http(s)?:\/\/(?!blog\.$)example.com(\.\w+)?\/(?!news$|archive$|blog$).*/
However, I'm missing something as http://example.com/test/blog/test
should not match.
Upvotes: 0
Views: 60
Reputation: 160551
Rather than use a complex regex, which will usually grow even more complex and difficult to manage over time, I'd recommend writing a simple method that breaks the test down into smaller parts, and returns a true/false whether the URL is valid/usable.
require 'uri'
def match_uri(url)
uri = URI.parse(url)
if uri.host != 'example.com' ||
uri.path[%r!^/(?:news|archives|blog)/!i] ||
uri.path[%r!/blog/!i]
return false
end
true
end
# 'http://example.com/test' should match
match_uri('http://example.com/test') # => true
# 'http://blog.example.com/test' should not match
match_uri('http://blog.example.com/test') # => false
# 'http://example.com/test/blog/test' should not match
match_uri('http://example.com/test/blog/test') # => false
# 'http://example.com/test/test2' should match
match_uri('http://example.com/test/test2') # => true
Here's what URI is returning:
uri = URI.parse('http://example.com/path/to/file')
uri.host # => "example.com"
uri.path # => "/path/to/file"
The only problem I see with the logic you're using, is a "path/to/file" could actually be "path/to/blog.ext" which would cause the logic to break. If that's possible, using:
File.dirname(uri.path) # => "/path/to"
will strip the filename off so the test only looks at the true path, not the path and file:
def match_uri(url)
uri = URI.parse(url)
uri_dir = File.dirname(uri.path)
if uri.host != 'example.com' ||
uri_dir[%r!^/(?:news|archives|blog)!i] ||
uri_dir[%r!/blog!i]
return false
end
true
end
"Regular Expressions: Now You Have Two Problems" is a good read.
Upvotes: 1
Reputation: 36101
%r{^https?://[^/]*(?<!blog\.)example\.com/(?!news/|archives/|blog/)(?!.*/blog(/|$)).*}
$
doesn't mean what I think you means and you were not excluding blog/
.
So here is a breakdown:
%r{}
, use it if you are going to escape forward slashes a lot^
-from the starthttps?//
- http// or https//[^/]*
- multiple characters, which are not forward slashes(?<!blog\.)
- negative lookbehind to ensure the subdomain was not blog.example.comexample\.com
- the example.com domain itself/(?!news/|archives/|blog/)
- after first slash, the "url token" is not news or archives or blog(?!.*/blog(/|$))
- any of the further "url tokens" are not blog.*
- match the remaining charactersUpvotes: 2