Oleg Novosad
Oleg Novosad

Reputation: 2421

How to match links without top-level domain using regex?

I use next regex (updated version of linkify regex) to match links and do not match emails.

(\s*|[^a-zA-Z0-9.\+_\/"\>\-]|^)(?:([a-zA-Z0-9\+_\-]+(?:\.[a-zA-Z0-9\+_\-]+)*@)?(http:\/\/|https:\/\/|ftp:\/\/|scp:\/\/){1}?((?:(?:[a-zA-Z0-9][a-zA-Z0-9_%\-_+]*\.)+))(?:[a-zA-Z]{2,})((?::\d{1,5}))?((?:[\/|\?](?:[\-a-zA-Z0-9_%#*&+=~!?,;:.\/]*)*)[\-\/a-zA-Z0-9_%#*&+=~]|\/?)?)([^a-zA-Z0-9\+_\/"\<\-]|$)

However this regex does not find urls like: https://someurl:3333/view/something

Can you please help me with this? Thanks!

Upvotes: 0

Views: 153

Answers (1)

Sam
Sam

Reputation: 20486

This should be the "least modified" version of your expression to match domains without top-levels:

(\s*|[^a-zA-Z0-9.\+_\/"\>\-]|^)(?:([a-zA-Z0-9\+_\-]+(?:\.[a-zA-Z0-9\+_\-]+)*@)?(http:\/\/|https:\/\/|ftp:\/\/|scp:\/\/){1}?((?:[a-zA-Z0-9][a-zA-Z0-9_%\-_+.]*)(?:\.[a-zA-Z]{2,})?)((?::\d{1,5}))?((?:[\/|\?](?:[\-a-zA-Z0-9_%#*&+=~!?,;:.\/]*)*)[\-\/a-zA-Z0-9_%#*&+=~]|\/?)?)([^a-zA-Z0-9\+_\/"\<\-]|$)

The part that change was capture group 3, the one that grabbed the domain. It went from:

(
 (?:
  (?:
   [a-zA-Z0-9]
   [a-zA-Z0-9_%\-_+]*
   \.
  )+                  (?# this is how they repeated for optional subdomains)
 )
)
(?:
 [a-zA-Z]{2,}         (?# here is the mandatory TLD)
)

To this:

(
 (?:
  [a-zA-Z0-9]
  [a-zA-Z0-9_%\-_+.]* (?# the . is in the character class here for subdomains)
 )
 (?:
  \.
  [a-zA-Z]{2,}
 )?                   (?# this TLD is optional)
)

Demo

Upvotes: 1

Related Questions