Chris
Chris

Reputation: 59501

Find last occurrence of period with regex

I'm trying to create a regex for validating URLs. I know there are many advanced ones out there, but I want to create my own for learning purposes.

So far I have a regex that works quite well, however I want to improve the validation for the TLD part of the URI because I feel it's not quite there yet.

Here's my regex (or find it on regexr):

/^[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:._\+~#=]{2,256}\.[a-zA-Z]{2,6}\b([/#?]{0,1}([A-Za-z0-9-._~:?#[\]@!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)$/

It works well for links such as foo.com or http://foo.com or foo.co.uk

The problem appears when you introduce subdomains or second-level domains such as co.uk because the regex will accept foo.co.u or foo.co..

I did try using the following to select the substring after the last .:

/[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:._\+~#=]{2,256}[^.]{2,}$/

but this prevents me from defining the path rules of the URI.

How can I ensure that the substring after the last . but before the first /, ? or # is at least 2 characters long?

Upvotes: 0

Views: 62

Answers (1)

SamWhan
SamWhan

Reputation: 8332

From what I can see, you're almost there. Made some modification and it seems to work.

^(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9@:._\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([A-Za-z0-9-._~:?#[\]@!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$

Can be somewhat shortened by doing

^(http(s)?:\/\/)?(www\.)?[\w@:.\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([-\w.~:?#[\]@!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$

(basically just tweaked your regex)

The main difference is that the parameter part is optional, but if it is there it has to start with one of /#?;. That part could probably be simplified as well.

Check it out here.

Edit:

After some experimenting I think this one is about as simple it'll get:

^(http(?:s)?:\/\/)?([-.~\w]+\.[a-zA-Z]{2,6})(:\d+)?(\/[-.~\w]*)?([@/#?;].*)?$

It also captures the separate parts - scheme, host, port, path and query/params.

Example here.

Upvotes: 1

Related Questions