Tarquin
Tarquin

Reputation: 492

PHP - Parsing URL's in a message while ignoring all HTML Tags

I am trying to process messages in a small, private, ticketing system that will automatically parse URL's into clickable links without messing up any HTML that may be posted. Up until now, the function to parse URL's has worked well, however one or two users of the system want to be able to post embedded images rather than as attachments.

This is the existing code that converts strings into clickable URL's, please note I have limited knowledge of regex and have relied on some assistance from others to build this

    $text = preg_replace(
     array(
       '/(^|\s|>)(www.[^<> \n\r]+)/iex',
       '/(^|\s|>)([_A-Za-z0-9-]+(\\.[A-Za-z]{2,3})?\\.[A-Za-z]{2,4}\\/[^<> \n\r]+)/iex',
       '/(?(?=<a[^>]*>.+<\/a>)(?:<a[^>]*>.+<\/a>)|([^="\']?)((?:https?):\/\/([^<> \n\r]+)))/iex'
     ),  
     array(
       "stripslashes((strlen('\\2')>0?'\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>&nbsp;\\3':'\\0'))",
       "stripslashes((strlen('\\2')>0?'\\1<a href=\"http://\\2\" target=\"_blank\">\\2</a>&nbsp;\\4':'\\0'))",
       "stripslashes((strlen('\\2')>0?'\\1<a href=\"\\2\" target=\"_blank\">\\3</a>&nbsp;':'\\0'))",
     ), $text);

    return $text;

How would I go about modifying an existing function, such as the one above, to exclude hits wrapped in HTML tags such as <img without hurting the functionality of the it.

Example:

`<img src="https://example.com/image.jpg">`

turns into

`<img src="<a href="https://example.com/image.jpg" target="_blank">example.com/image.jpg</a>">`

I have done some searching before posting, the most popular hits I am turning up are;

Obviously the common trend is "This is the wrong way to do it" which is obviously true - however while I agree, I also want to keep the function quite light. The system is used privately within the organisation and we only wish to process img tags and URL's automatically using this. Everything else is left plain, no lists, code tags quotes etc.

I greatly appreciate your assistance here.

Summary: How do I modify an existing set of regular expression rules to exclude matchs found within an img or other html tag found within a block of text.

Upvotes: 0

Views: 97

Answers (1)

mickmackusa
mickmackusa

Reputation: 47992

From what I can gather from the \e modifier error, your php version can be a maximum of only PHP5.4. preg_replace_callback() is available from PHP5.4 and up -- so it may be a tight squeeze!

While I would not like to be roped into a big back-and-forth with a multitude of answer edits, I would like to give you some traction.

My method to follow is certainly not something I would stake my career on. And as stated in comments under the question and in many, many pages on SO -- HTML should not be parsed by REGEX. (disclaimer complete)

PHP5.4.34 Demo Link & Regex Pattern Demo Link

$text='This has an img tag <img src="https://example.com/image.jpg"> that should be igrnored.
This is an img that needs to become a tag: https://example.com/image.jpg.
This is a <a href="https://www.example.com/image" target="_blank">tagged link</a> with target.
This is a <a href="https://example.com/image?what=something&when=something">tagged link</a> without target.
This is an untagged url http://example.com/image.jpg.
(Please extend this battery of test cases to isolate any monkeywrenching cases)
Another short url example.com/
Another short url example.com/index.php?a=b&c=d
Another www.example.com';
$pattern='~<(?:a|img)[^>]+?>(*SKIP)(*FAIL)|(((?:https?:)?(?:/{2})?)(w{3})?\S+(\.\S+)+\b(?:[?#&/]\S*)*)~';
function taggify($m){
    if(preg_match('/^bmp|gif|png|je?pg/',$m[4])){  // add more filetypes as needed
        return "<img src=\"{$m[0]}\">";
    }else{
        //var_export(parse_url($m[0]));  // if you need to do preparations, consider using parse_url()
        return "<a href=\"{$m[0]}\" target=\"_blank\">{$m[0]}</a>";
    }
}
$text=preg_replace_callback($pattern,'taggify',$text);
echo $text;

Output:

This has an img tag <img src="https://example.com/image.jpg"> that should be igrnored.
This is an img that needs to become a tag: <img src="https://example.com/image.jpg">.
This is a <a href="https://www.example.com/image" target="_blank">tagged link</a> with target.
This is a <a href="https://example.com/image?what=something&when=something">tagged link</a> without target.
This is an untagged url <img src="http://example.com/image.jpg">.
(Please extend this battery of test cases to isolate any monkeywrenching cases)
Another short url <a href="example.com/" target="_blank">example.com/</a>
Another short url <a href="example.com/index.php?a=b&c=d" target="_blank">example.com/index.php?a=b&c=d</a>
Another <a href="www.example.com" target="_blank">www.example.com</a>

The SKIP-FAIL technique works to "disqualify" unwanted matches. The qualifying matches will be expressed by the section of the pattern that follows the pipe (|) after (*SKIP)(*FAIL)

Upvotes: 1

Related Questions