lickmycode
lickmycode

Reputation: 2079

How to fix this preg_replace codes?

Having following code to turn an URL in a message into HTML links:

$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?-])*#",
    "<a href=\"away?to=\\0\" target=\"_blank\">\\0</a>", $message);

$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])*)#",
    "\\1<a href=\"away?to=http://\\2\" target=\"_blank\">\\2</a>", $message);

It works very good with almost all links, except in following cases:

1) http://example.com/mediathek#/video/1976914/zoom:-World-Wide

Problem here is the # and the : within the link, because not the complete link is transformed.

2) If someone just writes "www" in a message

Example: <a href="http://www">www</a>

So the question is about if there is any way to fix these two cases in the code above?

Upvotes: 0

Views: 183

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89639

In my opinion, it is vain to tackle this problem. A good alternative is to find what could be an URL via regex (begin with the protocol: http, ftp, mail... or by www) and then test it with FILTER_VALIDATE_URL. Keep in mind that this filter is not a waterproof way as the PHP manual says:

"Note that the function will only find ASCII URLs to be valid; internationalized domain names (containing non-ASCII characters) will fail."

Example of code (not tested):

$message = preg_replace_callback(
    '~(?(DEFINE)
          (?<prot> (?>ht|f) tps?+ :// )         # you can add protocols here
      )
      (?>
          <a\b (?> [^<]++ | < (?!/a>) )++ </a>  # avoid links inside "a" tags
        |
          <[^>]++>                              # and tags attributes.
      ) (*SKIP)(?!)                             # makes fail the subpattern.
      |                                         # OR
      \b(?>(\g<prot>)|www\.)(\S++)              # something that begins with
                                                # "http://" or "www."
     ~xi',
    function ($match) {
        if (filter_var($match[2], FILTER_VALIDATE_URL)) {
            $url = (empty($match[1])) ? 'http://' : '';
            $url .= $match[0];
            return '<a href="away?to=' . $url . '"target="_blank">'
                 . $url . '</a>';
        } else { return $match[0] }
    },
    $message);

Upvotes: 1

matewka
matewka

Reputation: 10168

Since you want to include the hash (#) to the regex, you need to change the delimiters to characters that are not included in your regex, e.g. !. So, your regex should look like this:

$message = preg_replace("!(http|https|ftp|ftps)://([.]?[&;%#:=a-zA-Z0-9_/?-])*!",
"<a href=\"away?to=\\0\" target=\"_blank\">\\0</a>", $message);

Does this help?

Though, if you would like to be more along the specification (RCF 1738) you might want to exclude % which is not allowed in URLs. There are also some more allowed characters which you didn't include:

  • $
  • _
  • . (dot)
  • +
  • !
  • *
  • '
  • (
  • )

If you would include these chars, you should then delimiter your regex with %.

Upvotes: 2

elixenide
elixenide

Reputation: 44851

Couple minor tweaks. Add \# and : to the first regex, then change the * to + in the second regex:

$message = preg_replace("#(http|https|ftp|ftps)://([.]?[&;%=a-zA-Z0-9_/?\#:-])*#",
    "<a href=\"away?to=\\0\" target=\"_blank\">\\0</a>", $message);

$message = preg_replace("#(^| |\n)(www([.]?[&;%=a-zA-Z0-9_/?-])+)#",
    "\\1<a href=\"away?to=http://\\2\" target=\"_blank\">\\2</a>", $message);

Upvotes: 1

Related Questions