Reputation: 2549
I have a bunch of functions that are filtering a page down to the domains that are attached to email addresses. It's all working great except for one small thing, some of the links are coming out like this:
EXAMPLE.COM
EXAMPLE.ORG.
EXAMPLE.ORG>.
EXAMPLE.COM"
EXAMPLE.COM".
EXAMPLE.COM).
EXAMPLE.COM(COMMENT)"
DEPT.EXAMPLE.COM
EXAMPLE.ORG
EXAMPLE.COM.
I want to figure out one last filter (regex or not) that will remove everything after the TLD. All of these items are in an array.
EDIT
The function I'm using:
function filterByDomain(array) {
var regex = new RegExp("([^.\n]+\.[a-z]{2,6}\b)", 'gi');
return array.filter(function(text){
return regex.test(text);
});
}
Upvotes: 0
Views: 637
Reputation: 785631
You can probably use this regex to match your TLD for each case:
/^[^.\n]+\.[a-z]{2,63}$/gim
You validation function can be:
function filterByDomain(array) {
var regex = /^[^.\n]+\.[a-z]{2,63}$/gim;
return array.filter(function(text){
return regex.test(text);
});
}
PS: Do read this Q & A to see that up to 63 characters are allowed in TLD.
Upvotes: 2
Reputation: 32922
I'd match all leading [\w.]
and omit the last dot, if any:
var result = url.match(/^[\w\.]+/).join("");
if(result.slice(-1)==".") result = result.slice(0,-1);
With note that \w
should be replaced for something more sophisticated:
_
is part of \w
set but should not be in url path-
is not part of \w
but can be in url not adjacent to .
or -
To keep the regexp simple and the code readable, I'd do it this way
_
for #
in url (both #
and _
can be only after TLD)-
for _
(_
is part of \w
)_
back for -
URL like www.-example-.com
would still pass, can be detected by searching for [.-]{2,}
Upvotes: 0