guillaume
guillaume

Reputation: 1678

URL regex excluding a specific domain not matching correctly

I'm trying to match some expression with regex but it's not working. I want to match a string not starting with http://www.domain.com. Here is my regex :

^https?:\/\/(www\.)?(?!domain\.com)

Is there a problem with my regex?

I want to match expression starting with http:// but different from http://site.com For example:

/page.html => false
http://www.google.fr => true
http://site.com => false
http://site.com/page.html => false

Upvotes: 0

Views: 7972

Answers (3)

JonM
JonM

Reputation: 1374

The problem here is that when the regex engine encounters the successful match on the negative look-ahead it will treat the match as a failure (as expected) and backtrack to the previous group (www\.) quantified as optional and then see if the expression is successful without it. This is what you have over looked.

This could be fixed with the application of atomic grouping or possessive quantifiers to 'forget' the possibility of backtracking. Unfortunately python regex doesn't support this natively. Instead you'll have to use a much less efficient method: using a larger look-ahead.

^https?:\/\/(?!(www\.)?(domain\.com))

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1122342

You want a negative look-ahead assertion:

^https?://(?!(?:www\.)?site\.com).+

Which gives:

>>> testdata = '''\
... /page.html => false
... http://www.google.fr => true
... http://site.com => false
... http://site.com/page.html => false
... '''.splitlines()
>>> not_site_com = re.compile(r'^https?://(?!(?:www\.)?site\.com).+')
>>> for line in testdata:
...     match = not_site_com.search(line)
...     if match: print match.group()
... 
http://www.google.fr => true

Note that the pattern excludes both www.site.com and site.com:

>>> not_site_com.search('https://www.site.com')
>>> not_site_com.search('https://site.com')
>>> not_site_com.search('https://site-different.com')
<_sre.SRE_Match object at 0x10a548510>

Upvotes: 0

Daedalus
Daedalus

Reputation: 1667

Use this to match a URL that does not have the domain you mention: https?://(?!(www\.domain\.com\/?)).*

Example in action: http://regexr.com?34a7p

Upvotes: 7

Related Questions