Reputation: 191
I want to match and extract a domain name. I have the following line of code:
result = re.findall(r"(^((?!-))(xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]{0,1}\.(xn--)?([a-z0-9\-]{1,61}|[a-z0-9-]{1,30}\.[a-z]{2,})$)", text)
It does work well for domains like example.org
, example.org.eu
. But it does not work for domains like sub_example.example.org.eu
.
Upvotes: 1
Views: 106
Reputation: 626802
Expanding and pruning your pattern, the pattern you may use to match the third type of strings is
^(?:(?:xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]?\.)+(?:xn--)?(?:[a-z0-9-]{1,61}|[a-z0-9-]{1,30}\.[a-z]{2,})$
See the regex demo.
The main point is that I wrapped the (?:xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]?\.
part with a non-capturing group and quantified it with +
(one or more repetitions).
Note you may use it with re.findall
directly as I removed all capturing groups so you do not need to wrap it with parentheses.
You do not need the first (?!-)
as the next consuming pattern does not match a hyphen, so I removed it.
Details
^
- start of string(?:(?:xn--)?[a-z0-9][a-z0-9-_]{0,61}[a-z0-9]?\.)+
- 1 or more sequences of
(?:xn--)?
- an optional xn--
substring[a-z0-9]
- a lowercase ASCII letter or digit[a-z0-9-_]{0,61}
- 0 to 61 lowercase ASCII letters, digits, -
or _
[a-z0-9]?
- an optional lowercase ASCII letter or digit\.
- a dot(?:xn--)?
- an optional xn--
string(?:[a-z0-9-]{1,61}|[a-z0-9-]{1,30}\.[a-z]{2,})
- either of the two alternatives:
[a-z0-9-]{1,61}
- 1 to 61 lowercase ASCII letters, -
or digits|
- or[a-z0-9-]{1,30}\.[a-z]{2,}
- 1 to 30 lowercase ASCII letters, -
or digits, a dot and two lowercase ASCII letters$
- end of string.Upvotes: 2