Vsevolod Fedorov
Vsevolod Fedorov

Reputation: 521

How i can match root of domain name without www. using regex

I try to match root of domain name with regular expressions in JS. I have a problem when path hasn't www. in himself.

For example, i tried match from this string:

(http://web.archive.org/web/20080620033027/http://www.mrvc.indianrail.gov.in/overview.htm)

Thats regex what i try is presented below. I try him on regex101.com

/(?<=(\/\/(www\.)|\/\/)).+?(?=\/)/g

I expect the output array with names web.archive.org and mrvc.indianrail.gov.in but get web.archive.org and www.mrvc.indianrail.gov.in with www. in second case.

Upvotes: 6

Views: 752

Answers (2)

doctorgu
doctorgu

Reputation: 634

First you have to understand how regex matches.

If you set or(|) group, it matches whole group for each one character. For example, input is 123 122 and pattern is (123|12). Second group(12) always matches to both of two words.

Because first and second character of both two words matches 12 group already at second character, there is no need to check third character.

I think your purpose is to apply 123 group first for whole word(123) and ignore 12 group because 123 group already matched.

I suggest not using look behind, and get first group($1) like following:

\/\/(?:www\.)?(.+?)\/

https://regex101.com/r/Ufxzeq/1

Upvotes: 0

Allan
Allan

Reputation: 12448

What about this regex:

(?<=https?:\/\/(?:www\.)?)(?!www\.).+?(?=\/)

it matches web.archive.org and mrvc.indianrail.gov.in without the www.

demo: https://regex101.com/r/5ZqK7n/3/

Differences with your initial regex:

  • In your positive lookbehind clause, I have s? to support https: URLs (remove it if not necessary)
  • (?:www\.)? can appear 0 to 1 time

  • After the lookbehind you add a negative lookahead (?!www\.) to not match, to avoid that your .+? matches the initial www.

Upvotes: 1

Related Questions