Reputation: 1605
I need to find a way to read content posted by user to find any hyperlinks that might have been included, create anchor tags, add target and rel=nofollow attribute to all those links.
I have come across some REGEX solutions like this:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
But on other questions on SO about the same problem, it has been highly recommended NOT to use REGEX instead use DOMDocument
of PHP.
Whatever be the best way, I need to add some attributes like mentioned above in order to harden all external links on website.
Upvotes: 2
Views: 2641
Reputation: 14811
First of all, the guidelines you mentioned advised against parsing HTML with regexes. As far as I understand, what you are trying to do is to parse plain text from user and convert it into HTML. For that purpose, regexes are usually just fine.
(Note that I assume you parse the text into links yourself and aren't using external library for that. In the latter case you'd need to fix the HTML the library outputs, and for this you should use DOMDocument
to iterate over all <a>
tags and add them proper attributes.)
Now, you can parse it in two ways: server side, or client side.
Pros:
Cons:
rel="nofollow"
attribute for the bots to not follow the links.Pros:
rel="nofollow"
attribute for the bots, since they don't see the links in the first place - they're generated with Javascript and bots usually don't parse Javascript.Cons:
I'll focus on implementing it server-side.
So, in order to parse links from user input and add them any attribute you want, you can use something like this:
<?php
function replaceLinks($text)
{
$regex = '/'
. '(?<!\S)'
. '(((ftp|https?)?:?)\/\/|www\.)'
. '(\S+?)'
. '(?=$|\s|[,]|\.\W|\.$)'
. '/m';
return preg_replace_callback($regex, function($match)
{
return '<a'
. ' target=""'
. ' rel="nofollow"'
. ' href="' . $match[0] . '">'
. $match[0]
. '</a>';
}, $text);
}
Explanation:
(?<!\S)
: not preceded by non-whitespace characters.(((ftp|https?)?:?)\/\/|www\.)
: accept ftp://
, http://
, https://
, ://
, //
and www.
as beginning of URLs.(\S+?)
match everything that is not whitespace in non-greedy fashion.(?=$|\s|[,]|\.\W|\.$)
every URL must be follow by either end of line, a whitespace, a comma, a dot followed by character other than \w
(this is to allow .com
, .co.jp
etc to match) or by a dot followed by end of line.m
flag - match multiline text.Now, to support my claim that it works I added a few test cases:
$tests = [];
$tests []= ['http://example.com', '<a target="" rel="nofollow" href="http://example.com">http://example.com</a>'];
$tests []= ['https://example.com', '<a target="" rel="nofollow" href="https://example.com">https://example.com</a>'];
$tests []= ['ftp://example.com', '<a target="" rel="nofollow" href="ftp://example.com">ftp://example.com</a>'];
$tests []= ['://example.com', '<a target="" rel="nofollow" href="://example.com">://example.com</a>'];
$tests []= ['//example.com', '<a target="" rel="nofollow" href="//example.com">//example.com</a>'];
$tests []= ['www.example.com', '<a target="" rel="nofollow" href="www.example.com">www.example.com</a>'];
$tests []= ['[email protected]', '[email protected]'];
$tests []= ['testhttp://example.com', 'testhttp://example.com'];
$tests []= ['example.com', 'example.com'];
$tests []= [
'test http://example.com',
'test <a target="" rel="nofollow" href="http://example.com">http://example.com</a>'];
$tests []= [
'multiline' . PHP_EOL . 'blah http://example.com' . PHP_EOL . 'test',
'multiline' . PHP_EOL . 'blah <a target="" rel="nofollow" href="http://example.com">http://example.com</a>' . PHP_EOL . 'test'];
$tests []= [
'text //example.com/slashes.php?parameters#fragment, some other text',
'text <a target="" rel="nofollow" href="//example.com/slashes.php?parameters#fragment">//example.com/slashes.php?parameters#fragment</a>, some other text'];
$tests []= [
'text //example.com. new sentence',
'text <a target="" rel="nofollow" href="//example.com">//example.com</a>. new sentence'];
Each test case is composed of two parts: source input and expected output. I used following code to determine whether the function passes the tests above:
foreach ($tests as $test)
{
list ($source, $expected) = $test;
$actual = replaceLinks($source);
if ($actual != $expected)
{
echo 'Test ' . $source . ' failed.' . PHP_EOL;
echo 'Expected: ' . $expected . PHP_EOL;
echo 'Actual: ' . $actual . PHP_EOL;
die;
}
}
echo 'All tests passed' . PHP_EOL;
I think this gives you idea how to solve the problem. Feel free to add more tests and experiment with regex itself to make it suitable for your specific needs.
Upvotes: 2
Reputation: 2869
Get the content to post using jquery and process it before posting it to PHP.
$('#idof_content').val(
$('#idof_content').val().replace(/\b(http(s|):\/\/|)(www\.\S+)/ig,
"<a href='http\$2://\$3' target='_blank' rel='nofollow'>\$3</a>"));
Upvotes: 0