If a hyperlink is found in string, allow only specific domains

Question

I'm having an issue with some spam links at the moment. My users are allowed to post hyperlinks in an input box and I'd like to be able to restrict this to only certain domains (if a hyperlink is found in the text).

The spam has gotten to a point where I am disabling all hyperlinks using the following regex:

if(new RegExp("([a-zA-Z0-9]+://)?([a-zA-Z0-9_]+:[a-zA-Z0-9_]+@)?([a-zA-Z0-9.-]+\.[A-Za-z]{2,4})(:[0-9]+)?(/.*)?").test(contentString)){
        alert("URLs are not allowed!");
        return false;       
}

I want to ease this up a bit and only allow specific hyperlink domains.

I tried this which was found here:

function isAllowed(urlString)
{
    var allowed = ['example.com', 'stackoverflow.com', 'google.com'];
    var urlObject = new URL(urlString);

    return allowed.indexOf(urlObject.host) > -1;
}

console.log(isAllowed('http://example.com/path/?q=1')); // true
console.log(isAllowed('https://subdomain.example.com/')); // false
console.log(isAllowed('http://stacksnippets.net')); // false

if (!isAllowed(document.getElementById('yourTextbox').value))
{
    alert('Domain is not allowed!');
}

However, this only works if the string is a hyperlink itself so now I am a bit stumped on how to accomplish this.

derpirscher · Accepted Answer

Let's assume your regex works (didn't look into it very deep), then, instead of just using .test(...) to test whether some string matches the regex, you can also use .exec(...) to get more information about the match, especially you get the capture groups of a match. Additionally using the g (global) flag of the regex will allow you to get all matches for the regex in the string, not only the first one. And maybe you should also add the i flag to make your regex caseinsensitive.

var alloweddomains = ["stackoverflow.com", "google.com", ...];    
var regex = new RegExp(..., "gi");
var teststring = "foo bar https://stackoverflow.com/questions/123456 baz blubb"
var m = regex.match(teststring);

//while m !== null, the regexp returned a match
//so this iterates over all matches in the teststring
while (m !== null) {
   //handle the current match. see explanation below
   ....

   //get the next match
   m = regex.match(teststring);
}

regex.match() will either return null if the regex doesn't match (ie the string is not a URL) or an array, containing the capture groups. For that particular example it will return the following array

["https://stackoverflow.com/question/123456", 
 "https://", 
 undefined, 
 "stackoverflow.com", 
 undefined, 
 "/question/123456"
]

The first element in the array is the whole match, the other elements are capturing groups defined by (...) in your regex. For your usecase m[3] is of particular interest, because it contains the domain of the URL. With that information you can now easily check if the domain is included in your list of allowed domains

if  (!alloweddomains.includes(m[3].toLowerCase()) {
  alert("domain is not allowed");
}

With the check, also use .toLowerCase() because with the i flag, the regex will also match "HTTPS://STACKOVERFLOW.COM", but .includes() is casesensitive, so it wouldn't find it in the array of alloweddomains if it's uppercase ...

EDIT

On a closer look, the very last part of your regex could be problematic

(/.*)?

will match the entire rest of the string, if text before is indeed a URL. I'd suggest to use something like

(/[^\s]*)?

here, so that the regex will end the match at the first whitespace.

And also limiting the TLD to 2-4 characters doesn't seem correct anymore. There are many TLD like .cityname or whatever around which won't be matched.

If a hyperlink is found in string, allow only specific domains

Answers (2)

Related Questions