Reputation: 133
I am having a school assignment about Regex. I will explain it first.
I have to write a regex for checking URLs, the conditions I have to check are:
Is the URL http(s) or ftp(s)?
Is the domain .nl or .edu?
There's atleast a third level domain, but if the domain starts with www. there has to be a fourth level domain.
Here is the regex I currently have:
(https?|ftps?):\/\/(www\.)?[a-z]+\.[a-z]+\.(nl|edu)$
My URL is:
http://www.lib.hva.nl
The URL currently passes the regex, but when I remove .lib or .hva for example it still passes and that should not happen. When there's www. in the domain the domain should have four levels. Could someone help me out with this issue?
Upvotes: 2
Views: 170
Reputation: 1
You can also use {n} for exactly n occurences which might be more readable sometimes. You can easly increase subdomains amount.
(https?|ftps?):\/\/(www\.)?+([a-z]+\.){2}(nl|edu)$
Upvotes: 0
Reputation: 19315
this can be resolve using possessive quantifier +
after (www\.)?
(https?|ftps?):\/\/(www\.)?+[a-z]+\.[a-z]+\.(nl|edu)$
explanation
(https?|ftps?):\/\/(www\.)?[a-z]+\.[a-z]+\.(nl|edu)$
matches
http://www.lib.nl
because after failing engine backtrack until (www\.)?
and as [a-z]+.
matches also www.
the match succeeds, to avoid backtracking (www\.)?
, possesive quantifier can be used.
other options can be to use a negative lookahead or an atomic group (as in the regex101 link).
Upvotes: 10
Reputation: 11336
The issue is that [a-z]+
also matches www
. In order to prevent this, use a negative look-ahead assertion before your first instance of [a-z]+
, like this:
(https?|ftps?):\/\/(www\.)?(?!www\.)[a-z]+\.[a-z]+\.(nl|edu)$
Upvotes: 2