Reputation: 2079
Having following code to turn an URL in a message into HTML links:
$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?-])*#",
"<a href=\"away?to=\\0\" target=\"_blank\">\\0</a>", $message);
$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])*)#",
"\\1<a href=\"away?to=http://\\2\" target=\"_blank\">\\2</a>", $message);
It works very good with almost all links, except in following cases:
1) http://example.com/mediathek#/video/1976914/zoom:-World-Wide
Problem here is the #
and the :
within the link, because not the complete link is transformed.
2) If someone just writes "www" in a message
Example: <a href="http://www">www</a>
So the question is about if there is any way to fix these two cases in the code above?
Upvotes: 0
Views: 183
Reputation: 89639
In my opinion, it is vain to tackle this problem. A good alternative is to find what could be an URL via regex (begin with the protocol: http, ftp, mail... or by www) and then test it with FILTER_VALIDATE_URL. Keep in mind that this filter is not a waterproof way as the PHP manual says:
"Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail."
Example of code (not tested):
$message = preg_replace_callback(
'~(?(DEFINE)
(?<prot> (?>ht|f) tps?+ :// ) # you can add protocols here
)
(?>
<a\b (?> [^<]++ | < (?!/a>) )++ </a> # avoid links inside "a" tags
|
<[^>]++> # and tags attributes.
) (*SKIP)(?!) # makes fail the subpattern.
| # OR
\b(?>(\g<prot>)|www\.)(\S++) # something that begins with
# "http://" or "www."
~xi',
function ($match) {
if (filter_var($match[2], FILTER_VALIDATE_URL)) {
$url = (empty($match[1])) ? 'http://' : '';
$url .= $match[0];
return '<a href="away?to=' . $url . '"target="_blank">'
. $url . '</a>';
} else { return $match[0] }
},
$message);
Upvotes: 1
Reputation: 10168
Since you want to include the hash (#
) to the regex, you need to change the delimiters to characters that are not included in your regex, e.g. !
. So, your regex should look like this:
$message = preg_replace("!(http|https|ftp|ftps)://([.]?[&;%#:=a-zA-Z0-9_/?-])*!",
"<a href=\"away?to=\\0\" target=\"_blank\">\\0</a>", $message);
Does this help?
Though, if you would like to be more along the specification (RCF 1738) you might want to exclude %
which is not allowed in URLs. There are also some more allowed characters which you didn't include:
If you would include these chars, you should then delimiter your regex with %
.
Upvotes: 2
Reputation: 44851
Couple minor tweaks. Add \#
and :
to the first regex, then change the *
to +
in the second regex:
$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?\#:-])*#",
"<a href=\"away?to=\\0\" target=\"_blank\">\\0</a>", $message);
$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])+)#",
"\\1<a href=\"away?to=http://\\2\" target=\"_blank\">\\2</a>", $message);
Upvotes: 1