Lee Jack
Lee Jack

Reputation: 191

Regular expression to match and extract a long domain

I want to match and extract a domain name. I have the following line of code:

result = re.findall(r"(^((?!-))(xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]{0,1}\.(xn--)?([a-z0-9\-]{1,61}|[a-z0-9-]{1,30}\.[a-z]{2,})$)", text)

It does work well for domains like example.org, example.org.eu. But it does not work for domains like sub_example.example.org.eu.

Upvotes: 1

Views: 106

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626802

Expanding and pruning your pattern, the pattern you may use to match the third type of strings is

^(?:(?:xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]?\.)+(?:xn--)?(?:[a-z0-9-]{1,61}|[a-z0-9-]{1,30}\.[a-z]{2,})$

See the regex demo.

The main point is that I wrapped the (?:xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]?\. part with a non-capturing group and quantified it with + (one or more repetitions).

Note you may use it with re.findall directly as I removed all capturing groups so you do not need to wrap it with parentheses.

You do not need the first (?!-) as the next consuming pattern does not match a hyphen, so I removed it.

Details

  • ^ - start of string
  • (?:(?:xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]?\.)+ - 1 or more sequences of
    • (?:xn--)? - an optional xn-- substring
    • [a-z0-9] - a lowercase ASCII letter or digit
    • [a-z0-9-_]{0,61} - 0 to 61 lowercase ASCII letters, digits, - or _
    • [a-z0-9]? - an optional lowercase ASCII letter or digit
    • \. - a dot
  • (?:xn--)? - an optional xn-- string
  • (?:[a-z0-9-]{1,61}|[a-z0-9-]{1,30}\.[a-z]{2,}) - either of the two alternatives:
    • [a-z0-9-]{1,61} - 1 to 61 lowercase ASCII letters, - or digits
    • | - or
    • [a-z0-9-]{1,30}\.[a-z]{2,} - 1 to 30 lowercase ASCII letters, - or digits, a dot and two lowercase ASCII letters
  • $ - end of string.

Upvotes: 2

Related Questions