Henley Wing Chiu
Henley Wing Chiu

Reputation: 22515

How to get regex of URL that does not have a word as a token?

How do I match an URL that matches all of these conditions:

So:

http://example.com/test should match

http://blog.example.com/test should not match

http://example.com/test/blog/test should not match

http://example.com/test/test2 should match

Here is what I have so far:

regex = /^http(s)?:\/\/(?!blog\.$)example.com(\.\w+)?\/(?!news$|archive$|blog$).*/

However, I'm missing something as http://example.com/test/blog/test should not match.

Upvotes: 0

Views: 60

Answers (2)

the Tin Man
the Tin Man

Reputation: 160551

Rather than use a complex regex, which will usually grow even more complex and difficult to manage over time, I'd recommend writing a simple method that breaks the test down into smaller parts, and returns a true/false whether the URL is valid/usable.

require 'uri'

def match_uri(url)
  uri = URI.parse(url)

  if uri.host != 'example.com' ||
    uri.path[%r!^/(?:news|archives|blog)/!i] ||
    uri.path[%r!/blog/!i]
    return false
  end

  true
end


# 'http://example.com/test' should match
match_uri('http://example.com/test') # => true

# 'http://blog.example.com/test' should not match
match_uri('http://blog.example.com/test') # => false

# 'http://example.com/test/blog/test' should not match
match_uri('http://example.com/test/blog/test') # => false

# 'http://example.com/test/test2' should match
match_uri('http://example.com/test/test2') # => true

Here's what URI is returning:

uri = URI.parse('http://example.com/path/to/file')
uri.host # => "example.com"
uri.path # => "/path/to/file"

The only problem I see with the logic you're using, is a "path/to/file" could actually be "path/to/blog.ext" which would cause the logic to break. If that's possible, using:

File.dirname(uri.path) # => "/path/to"

will strip the filename off so the test only looks at the true path, not the path and file:

def match_uri(url)
  uri = URI.parse(url)

  uri_dir = File.dirname(uri.path)

  if uri.host != 'example.com' ||
    uri_dir[%r!^/(?:news|archives|blog)!i] ||
    uri_dir[%r!/blog!i]
    return false
  end

  true
end

"Regular Expressions: Now You Have Two Problems" is a good read.

Upvotes: 1

ndnenkov
ndnenkov

Reputation: 36101

%r{^https?://[^/]*(?<!blog\.)example\.com/(?!news/|archives/|blog/)(?!.*/blog(/|$)).*}

See it in action


There were quite some problems with your original regex. Mainly, $ doesn't mean what I think you means and you were not excluding blog/.

So here is a breakdown:

  • There is an alternative syntax for creating regexes %r{}, use it if you are going to escape forward slashes a lot
  • ^ -from the start
  • https?// - http// or https//
  • [^/]* - multiple characters, which are not forward slashes
  • (?<!blog\.) - negative lookbehind to ensure the subdomain was not blog.example.com
  • example\.com - the example.com domain itself
  • /(?!news/|archives/|blog/) - after first slash, the "url token" is not news or archives or blog
  • (?!.*/blog(/|$)) - any of the further "url tokens" are not blog
  • .* - match the remaining characters

Upvotes: 2

Related Questions