Maartyn
Maartyn

Reputation: 11

How do I replace multiple instances of less than < in a php string that also uses strip_tags?

I have the following string stored in a database table that contains HTML I need to strip out before rendering on a web page (This is old content I had no control over).

<p>I am <30 years old and weight <12st</p>

When I have used strip_tags it is only showing I am.

I understand why the strip_tags is doing that so I need to replace the 2 instances of the < with &lt;

I have found a regex that converts the first instance but not the 2nd, but I can't work out how to amend this to replace all instances.

/<([^>]*)(<|$)/

which results in I am currently &lt;30 years old and less than

I have a demo here https://eval.in/1117956

Upvotes: 1

Views: 179

Answers (4)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

It's a bad idea to try to parse html content with string functions, including regex functions (there're many topics that explain that on SO, search them). html is too complicated to do that.

The problem is that you have poorly formatted html on which you have no control. There're two possible attitudes:

  • There's nothing to do: the data are corrupted, so informations are loss once and for all and you can't retrieve something that has disappear, that's all. This is a perfectly acceptable point of view. May be you can find another source for the same data somewhere or you can choose to print the poorly formatted html as it.
  • You can try to repair. In this case you have to ensure that all the document problems are limited and can be solved (at least by hand).

In place of a direct string approach, you can use the PHP libxml implementation via DOMDocument. Even if the libxml parser will not give better results than strip_tags, it provides errors you can use to identify the kind of error and to find the problematic positions in the html string.

With your string, the libxml parser returns a recoverable error XML_ERR_NAME_REQUIRED with the code 68 on each problematic opening angle bracket. Errors can be seen using libxml_get_errors().

Example with your string:

$s = '<p>I am <30 years old and weight <12st</p>';

$libxmlErrorState = libxml_use_internal_errors(true);

function getLastErrorPos($code) {
    $errors = array_filter(libxml_get_errors(), function ($e) use ($code) {
        return $e->code === $code;
    });

    if ( !$errors )
        return false;

    $lastError = array_pop($errors);
    return ['line' => $lastError->line - 1, 'column' => $lastError->column - 2 ];
}

define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name

$patternTemplate = '~(?:.*\R){%d}.{%d}\K<~A';

$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);

while ( false !== $position = getLastErrorPos(XML_ERR_NAME_REQUIRED) ) {
    libxml_clear_errors();
    $pattern = vsprintf($patternTemplate, $position);

    $s = preg_replace($pattern, '&lt;', $s, 1);
    $dom = new DOMDocument;
    $dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
}

echo $dom->saveHTML();

libxml_clear_errors();
libxml_use_internal_errors($libxmlErrorState);

demo

$patternTemplate is a formatted string (see sprintf in the php manual) in which the placeholders %d stand for respectively the number of lines before and the position from the start of the line. (0 and 8 here)

Pattern details: The goal of the pattern is to reach the angle bracket position from the start of the string.

~ # my favorite pattern delimiter
  (?:
      .* # all character until the end of the line
      \R # the newline sequence
  ){0} # reach the desired line

  .{8} # reach the desired column
  \K   # remove all on the left from the match result
  <    # the match result is only this character
~A # anchor the pattern at the start of the string

An other related question in which I used a similar technique: parse invalid XML manually

Upvotes: 1

Emma
Emma

Reputation: 27723

My guess is that here we might want to design a good right boundary to capture < in non-tags, maybe a simple expression similar to:

<(\s*[+-]?[0-9])

might work, since we should normally have numbers or signs right after <. [+-]?[0-9] would likely change, if we would have other instances after <.

Demo

Test

$re = '/<(\s*[+-]?[0-9])/m';
$str = '<p>I am <30 years old and weight <12st I am <  30 years old and weight <  12st I am <30 years old and weight <  -12st I am <  +30 years old and weight <  12st</p>';
$subst = '&lt;$1';

$result = preg_replace($re, $subst, $str);

echo $result;

Upvotes: 0

RiggsFolly
RiggsFolly

Reputation: 94652

A simple use of str_replace() would do it.

  1. Replace the <p> and </p> with [p] and [/p]
  2. replace the < with &lt;
  3. put the p tags back i.e. Replace the [p] and [/p] with <p> and </p>

Code

<?php
$description = "<p>I am <30 years old and weight <12st</p>";

$d = str_replace(['[p]','[/p]'],['<p>','</p>'], 
            str_replace('<', '&lt;', 
                str_replace(['<p>','</p>'], ['[p]','[/p]'], 
                    $description)));

echo $d;

RESULT

<p>I am &lt;30 years old and weight &lt;12st</p>

Upvotes: 0

Rakesh Jakhar
Rakesh Jakhar

Reputation: 6388

try this

$string = '<p>I am <30 years old and weight <12st</p>';
$html = preg_replace('/^\s*<[^>]+>\s*|\s*<\/[^>]+>\s*\z/', '', $string);// remove html tags
$final = preg_replace('/[^A-Za-z0-9 !@#$%^&*().]/u', '', $html); //remove special character

Live DEMO

Upvotes: 0

Related Questions