Reputation: 719

Link unlinked urls (BBCode) regex

I need a regex that looks for any URL that isn't already inside [url(=...)]...[/url] tags. In other words, I want to link any URL that isn't linked and replace the link with [url]link[/url] so that the parser I'm using can take care of it as it usually would.

I've been trying to get an understanding of negative lookaheads (which is apparently what I should make use of), but I just can't get it down.

This is what I've got so far:

preg_replace('/(?!\[url(=.*?)?\])(https?|ftps?|irc):\/\/(www\.)?(\w+(:\w+)?@)?[a-z0-9-]+(\.[a-z0-9-])*.*(?!\[\/url\])/i',"[url]$0[/url]",$Str);

Thanks

Upvotes: 1

Answers (3)

user966939

Reputation: 719

My solution:

<?php
$URLRegex = '/(?:(?<!(\[\/url\]|\[\/url=))(\s|^))';     // No [url]-tag in front and is start of string, or has whitespace in front
$URLRegex.= '(';                                        // Start capturing URL
$URLRegex.= '(https?|ftps?|ircs?):\/\/';                // Protocol
$URLRegex.= '\S+';                                      // Any non-space character
$URLRegex.= ')';                                        // Stop capturing URL
$URLRegex.= '(?:(?<![[:punct:]])|(?<=\/))(\s|\.?$)/i';  // Doesn't end with punctuation (excluding /) and is end of string (with a possible dot at the end), or has whitespace after

$Str = preg_replace($URLRegex,"$2[url]$3[/url]$5",$Str);
?>

Upvotes: 3

ridgerunner

Reputation: 34395

Linkifying unlinked URLs is not trivial. There are a lot of gotchas (See: The Problem with URLs) and the thread of comments following this blog entry. The problem is compounded when you have URLs which are already linked that you wish to skip over. I have looked into this problem and have been working on a solution - an open source project: LinkifyURL. Here is the most recent incarnation of a function which does what you are asking. Note that the regex is NOT trivial (but neither is the problem as it turns out).

function linkify($text) {
    $url_pattern = '/# Rev:20100913_0900 github.com\/jmrware\/LinkifyURL
    # Match http & ftp URL that is not already linkified.
      # Alternative 1: URL delimited by (parentheses).
      (\()                     # $1  "(" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $2: URL.
      (\))                     # $3: ")" end delimiter.
    | # Alternative 2: URL delimited by [square brackets].
      (\[)                     # $4: "[" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $5: URL.
      (\])                     # $6: "]" end delimiter.
    | # Alternative 3: URL delimited by {curly braces}.
      (\{)                     # $7: "{" start delimiter.
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $8: URL.
      (\})                     # $9: "}" end delimiter.
    | # Alternative 4: URL delimited by <angle brackets>.
      (<|&(?:lt|\#60|\#x3c);)  # $10: "<" start delimiter (or HTML entity).
      ((?:ht|f)tps?:\/\/[a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]+)  # $11: URL.
      (>|&(?:gt|\#62|\#x3e);)  # $12: ">" end delimiter (or HTML entity).
    | # Alternative 5: URL not delimited by (), [], {} or <>.
      (                        # $13: Prefix proving URL not already linked.
        (?: ^                  # Can be a beginning of line or string, or
        | [^=\s\'"\]]          # a non-"=", non-quote, non-"]", followed by
        ) \s*[\'"]?            # optional whitespace and optional quote;
      | [^=\s]\s+              # or... a non-equals sign followed by whitespace.
      )                        # End $13. Non-prelinkified-proof prefix.
      ( \b                     # $14: Other non-delimited URL.
        (?:ht|f)tps?:\/\/      # Required literal http, https, ftp or ftps prefix.
        [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]+ # All URI chars except "&" (normal*).
        (?:                    # Either on a "&" or at the end of URI.
          (?!                  # Allow a "&" char only if not start of an...
            &(?:gt|\#0*62|\#x0*3e);                  # HTML ">" entity, or
          | &(?:amp|apos|quot|\#0*3[49]|\#x0*2[27]); # a [&\'"] entity if
            [.!&\',:?;]?        # followed by optional punctuation then
            (?:[^a-z0-9\-._~!$&\'()*+,;=:\/?#[\]@%]|$)  # a non-URI char or EOS.
          ) &                  # If neg-assertion true, match "&" (special).
          [a-z0-9\-._~!$\'()*+,;=:\/?#[\]@%]* # More non-& URI chars (normal*).
        )*                     # Unroll-the-loop (special normal*)*.
        [a-z0-9\-_~$()*+=\/#[\]@%]  # Last char can\'t be [.!&\',;:?]
      )                        # End $14. Other non-delimited URL.
    /imx';
    $url_replace = '$1$4$7$10$13<a href="$2$5$8$11$14">$2$5$8$11$14</a>$3$6$9$12';
    return preg_replace($url_pattern, $url_replace, $text);
}

This solution does have some limitations, and recently I have been working on an improved version (which is simpler and works better) - but it is not yet ready for prime-time.

Be sure to take a look at the linkify test page where I have put together a list of really-hard-to-match-in-the-wild URLs.

Upvotes: 1

user149341

Reputation:

There's an excellent URL-matching regular expression here:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

Upvotes: 1

Link unlinked urls (BBCode) regex

Answers (3)

Related Questions