MarkII
MarkII

Reputation: 892

JavaScript Regex URL extract domain only

Currently I can extract the 'domain' from any URL with the following regex:

/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n\?\=]+)/im

However I'm also getting subdomain's too which I want to avoid. For example if I have sites:

I currently get:

Those last two I would like to exclude the freds and josh subdomain portion and extract only the true domain which would just be meatmarket.co.uk.

I did find another SOF that tries to solve in PHP, unfortunately I don't know PHP. is this translatable to JS (I'm actually using Google Script FYI)?

  function topDomainFromURL($url) {
    $url_parts = parse_url($url);
    $domain_parts = explode('.', $url_parts['host']);
    if (strlen(end($domain_parts)) == 2 ) { 
      // ccTLD here, get last three parts
      $top_domain_parts = array_slice($domain_parts, -3);
    } else {
      $top_domain_parts = array_slice($domain_parts, -2);
    }
    $top_domain = implode('.', $top_domain_parts);
    return $top_domain;
  }

Upvotes: 12

Views: 24066

Answers (6)

Vrushabh Ranpariya
Vrushabh Ranpariya

Reputation: 376

This solution works for me, also use it to validate the URL if it doesn't seems URL.

^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+\.+[^:\/?\n]+)

RegEX Demo

Thanks to @anubhava

Upvotes: 0

Volomike
Volomike

Reputation: 24916

This is what I've come up with. I don't know how to combine the two match rules into a single regexp, however. This routine won't properly process bad domains like example..com. It does, however, account for TLDs that are in the variety of .xx, .xx.xx, .xxx, or more than 4 character TLDs on the end. This routine will work on just domain names or entire URLs, and the URLs don't have to have the http or https protocol -- it could be ftp, chrome, and others.

function getRootDomain(s){
  var sResult = ''
  try {
    sResult = s.match(/^(?:.*\:\/?\/)?(?<domain>[\w\-\.]*)/i).groups.domain
      .match(/(?<root>[\w\-]*(\.\w{3,}|\.\w{2}|\.\w{2}\.\w{2}))$/).groups.root;
  } catch(ignore) {}
  return sResult;
}

So basically, the first routine strips out any potential stuff before the ://, if that exists, or just a :, if that exists. Next, it looks for all non-word boundary stuff except allows the dash and period like you'd potentially see in domains. It labels this into a named capture group called domain. It also prevents the domain match from including a port such as :8080 as an example. If given an empty string, it just returns an empty string back.

From there, we then do another pass on this and instead of looking from the left-to-right like you would with the preceding ^ symbol, we use the ending $ symbol, working right-to-left, and allow only 4 conditions on the end: .xx.xx, .xx, .xxx, or more than .xxx (such as 4+ character TLDs), where x is a non-word boundary item. Note the {3,} -- that means 3 or more of something, which is why we handle the TLDs that are 3 or more characters too. From there, we allow for a non-word boundary in front of that which may include dashes and periods.

EDIT: Since posting this answer, I learned how to combine the full domain and the root part into one single RegExp. However, I'll keep the above for reasons where you may want to get both values, although the function only returned the root (but with a quick edit, could have returned both full domain and root domain). So, if you just want the root alone, then you could use this solution:

function getRootDomain(s){
  var sResult = ''
  try {
    sResult = s.match(/^(?:.*?:\/\/)?.*?(?<root>[\w\-]*(?:\.\w{2,}|\.\w{2}\.\w{2}))(?:[\/?#:]|$)/).groups.root;
  } catch(ignore) {}
  return sResult;
}

Upvotes: 0

Kanan Farzali
Kanan Farzali

Reputation: 1053

export const extractHostname = url => {
let hostname;

// find & remove protocol (http, ftp, etc.) and get hostname
if (url.indexOf("://") > -1)
{
    hostname = url.split('/')[2];
}
else
{
    hostname = url.split('/')[0];
}

// find & remove port number
hostname = hostname.split(':')[0];

// find & remove "?"
hostname = hostname.split('?')[0];

return hostname;
};

export const extractRootDomain = url => {
let domain = extractHostname(url),
    splitArr = domain.split('.'),
    arrLen = splitArr.length;

// extracting the root domain here
// if there is a subdomain
if (arrLen > 2)
{
    domain = splitArr[arrLen - 2] + '.' + splitArr[arrLen - 1];

    // check to see if it's using a Country Code Top Level Domain (ccTLD) (i.e. ".me.uk")
    if (splitArr[arrLen - 2].length === 2 && splitArr[arrLen - 1].length === 2)
    {
        //this is using a ccTLD
        domain = splitArr[arrLen - 3] + '.' + domain;
    }
}

return domain;
};

Upvotes: 1

Oleg V. Volkov
Oleg V. Volkov

Reputation: 22461

So, you need firstmost hostname stripped from your result, unless there only two parts already?

Just postprocess your result from first match with regexp matching that condition:

function domain_from_url(url) {
    var result
    var match
    if (match = url.match(/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n\?\=]+)/im)) {
        result = match[1]
        if (match = result.match(/^[^\.]+\.(.+\..+)$/)) {
            result = match[1]
        }
    }
    return result
}

console.log(domain_from_url("www.google.com"))
console.log(domain_from_url("yahoo.com/something"))
console.log(domain_from_url("freds.meatmarket.co.uk?someparameter"))
console.log(domain_from_url("josh.meatmarket.co.uk/asldf/asdf"))

// google.com
// yahoo.com
// meatmarket.co.uk
// meatmarket.co.uk

Upvotes: 23

1111161171159459134
1111161171159459134

Reputation: 1215

Try to replace www by something else:

/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:[^.]+\.)?([^:\/\n\?\=]+)/im

EDIT: If you absolutely want to preserve the www into your regex, you could try this one:

/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?(?:[^.]+\.)?([^:\/\n\?\=]+)/im

Upvotes: 1

osanger
osanger

Reputation: 2362

Try this:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.([a-z]{2,6}){1}

Upvotes: 1

Related Questions