John Doe
John Doe

Reputation: 1639

regex : avoid group - url domain name

I wrote this regex for the re module which, as far as I know, works as expected :

^(https?://)([\w\.-]+)[\./]*(?(1)(domain-name.com))

run against a list of urls, it matches only the ones containing domain-name.com. But I don't understand why :

^(https?://)([\w\.-]+)[\./]*(?(1)(!(domain-name.com)))

does not return all the other urls. Actually it never matches anything.

Thank you

on pythex

Upvotes: 0

Views: 96

Answers (2)

ctwheels
ctwheels

Reputation: 22837

Matching domain-name.com

To match domain-name.com domains, use the following.

See regex in use here

^https?://(?:\w+(?:-\w+)*\.)*domain-name\.com(?=$|/)
  • ^ Assert position at the start of the line
  • https? Match http or https (s is optional)
  • :// Match this literally
  • (?:\w+(?:-\w+)*\.)* Match any number of subdomains. A subdomain cannot begin or end with -, so this subpattern does as follows:
    • \w+ Match one or more word characters
    • (?:-\w+)* Match the following any number of times
      • - Match this literally
      • \w+ Match one or more word characters
    • \. Match the dot character literally
  • domain-name\.com Matches domain-name.com literally
  • (?=$|[/?#]) Positive lookahead ensuring either the end of the line or a character in the set /?# follows

Matching non-domain-name.com

To match non-domain-name.com domains, use the following.

See regex in use here

^https?://(?:\w+(?:-\w+)*\.)*(?!domain-name\.com)[\w-]+\.[\w-]+(?=$|/)

This is the same as the first pattern except it uses (?!domain-name\.com)[\w-]+\.[\w-]+. This matches any domain that doesn't match domain-name.com literally

Upvotes: 1

Leyff da
Leyff da

Reputation: 226

You need to use negative lookahead with ?! instead of !

^(https?://)([\w\.-]+)[\./]*(?(1)(?!(domain-name.com)))

Upvotes: 0

Related Questions