Reputation: 2549

JavaScript to remove whatever is after the tld and before the whitespace

I have a bunch of functions that are filtering a page down to the domains that are attached to email addresses. It's all working great except for one small thing, some of the links are coming out like this:

EXAMPLE.COM
EXAMPLE.ORG.
EXAMPLE.ORG>.
EXAMPLE.COM"
EXAMPLE.COM".
EXAMPLE.COM).
EXAMPLE.COM(COMMENT)"
DEPT.EXAMPLE.COM
EXAMPLE.ORG
EXAMPLE.COM.

I want to figure out one last filter (regex or not) that will remove everything after the TLD. All of these items are in an array.

EDIT

The function I'm using:

function filterByDomain(array) {
    var regex = new RegExp("([^.\n]+\.[a-z]{2,6}\b)", 'gi');
    return array.filter(function(text){
        return regex.test(text);
    });
}

Upvotes: 0

Answers (2)

anubhava

Reputation: 785631

You can probably use this regex to match your TLD for each case:

/^[^.\n]+\.[a-z]{2,63}$/gim

RegEx Demo

You validation function can be:

function filterByDomain(array) {
    var regex = /^[^.\n]+\.[a-z]{2,63}$/gim;
    return array.filter(function(text){
        return regex.test(text);
    });
}

PS: Do read this Q & A to see that up to 63 characters are allowed in TLD.

Upvotes: 2

Jan Turoň

Reputation: 32922

I'd match all leading [\w.] and omit the last dot, if any:

var result = url.match(/^[\w\.]+/).join("");
if(result.slice(-1)==".") result = result.slice(0,-1);

With note that \w should be replaced for something more sophisticated:

_ is part of \w set but should not be in url path
- is not part of \w but can be in url not adjacent to . or -

To keep the regexp simple and the code readable, I'd do it this way

substitute _ for # in url (both # and _ can be only after TLD)
substitute - for _ (_ is part of \w)
after the regexp test, substitute _ back for -

URL like www.-example-.com would still pass, can be detected by searching for [.-]{2,}

Upvotes: 0

JavaScript to remove whatever is after the tld and before the whitespace

Answers (2)

Related Questions