regex : avoid group - url domain name

Question

I wrote this regex for the re module which, as far as I know, works as expected :

^(https?://)([\w\.-]+)[\./]*(?(1)(domain-name.com))

run against a list of urls, it matches only the ones containing domain-name.com. But I don't understand why :

^(https?://)([\w\.-]+)[\./]*(?(1)(!(domain-name.com)))

does not return all the other urls. Actually it never matches anything.

Thank you

on pythex

ctwheels · Accepted Answer

Matching domain-name.com

To match domain-name.com domains, use the following.

See regex in use here

^https?://(?:\w+(?:-\w+)*\.)*domain-name\.com(?=$|/)

^ Assert position at the start of the line
https? Match http or https (s is optional)
:// Match this literally
(?:\w+(?:-\w+)*\.)* Match any number of subdomains. A subdomain cannot begin or end with -, so this subpattern does as follows:
- \w+ Match one or more word characters
- (?:-\w+)* Match the following any number of times
  - - Match this literally
  - \w+ Match one or more word characters
- \. Match the dot character literally
domain-name\.com Matches domain-name.com literally
(?=$|[/?#]) Positive lookahead ensuring either the end of the line or a character in the set /?# follows

Matching non-domain-name.com

To match non-domain-name.com domains, use the following.

See regex in use here

^https?://(?:\w+(?:-\w+)*\.)*(?!domain-name\.com)[\w-]+\.[\w-]+(?=$|/)

This is the same as the first pattern except it uses (?!domain-name\.com)[\w-]+\.[\w-]+. This matches any domain that doesn't match domain-name.com literally

regex : avoid group - url domain name

Answers (2)

Matching domain-name.com

Matching non-domain-name.com

Related Questions