Reputation: 59501
I'm trying to create a regex for validating URLs. I know there are many advanced ones out there, but I want to create my own for learning purposes.
So far I have a regex that works quite well, however I want to improve the validation for the TLD part of the URI because I feel it's not quite there yet.
Here's my regex (or find it on regexr):
/^[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:._\+~#=]{2,256}\.[a-zA-Z]{2,6}\b([/#?]{0,1}([A-Za-z0-9-._~:?#[\]@!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)$/
It works well for links such as foo.com
or http://foo.com
or foo.co.uk
The problem appears when you introduce subdomains or second-level domains such as co.uk
because the regex will accept foo.co.u
or foo.co.
.
I did try using the following to select the substring after the last .
:
/[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:._\+~#=]{2,256}[^.]{2,}$/
but this prevents me from defining the path
rules of the URI.
How can I ensure that the substring after the last .
but before the first /
, ?
or #
is at least 2 characters long?
Upvotes: 0
Views: 62
Reputation: 8332
From what I can see, you're almost there. Made some modification and it seems to work.
^(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9@:._\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([A-Za-z0-9-._~:?#[\]@!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
Can be somewhat shortened by doing
^(http(s)?:\/\/)?(www\.)?[\w@:.\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([-\w.~:?#[\]@!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
(basically just tweaked your regex)
The main difference is that the parameter part is optional, but if it is there it has to start with one of /#?;
. That part could probably be simplified as well.
Edit:
After some experimenting I think this one is about as simple it'll get:
^(http(?:s)?:\/\/)?([-.~\w]+\.[a-zA-Z]{2,6})(:\d+)?(\/[-.~\w]*)?([@/#?;].*)?$
It also captures the separate parts - scheme, host, port, path and query/params.
Upvotes: 1