Reputation: 11
I have the following string stored in a database table that contains HTML I need to strip out before rendering on a web page (This is old content I had no control over).
<p>I am <30 years old and weight <12st</p>
When I have used strip_tags
it is only showing I am
.
I understand why the strip_tags is doing that so I need to replace the 2 instances of the <
with <
I have found a regex that converts the first instance but not the 2nd, but I can't work out how to amend this to replace all instances.
/<([^>]*)(<|$)/
which results in I am currently <30 years old and less than
I have a demo here https://eval.in/1117956
Upvotes: 1
Views: 179
Reputation: 89557
It's a bad idea to try to parse html content with string functions, including regex functions (there're many topics that explain that on SO, search them). html is too complicated to do that.
The problem is that you have poorly formatted html on which you have no control. There're two possible attitudes:
In place of a direct string approach, you can use the PHP libxml implementation via DOMDocument
. Even if the libxml parser will not give better results than strip_tags
, it provides errors you can use to identify the kind of error and to find the problematic positions in the html string.
With your string, the libxml parser returns a recoverable error XML_ERR_NAME_REQUIRED
with the code 68 on each problematic opening angle bracket. Errors can be seen using libxml_get_errors()
.
Example with your string:
$s = '<p>I am <30 years old and weight <12st</p>';
$libxmlErrorState = libxml_use_internal_errors(true);
function getLastErrorPos($code) {
$errors = array_filter(libxml_get_errors(), function ($e) use ($code) {
return $e->code === $code;
});
if ( !$errors )
return false;
$lastError = array_pop($errors);
return ['line' => $lastError->line - 1, 'column' => $lastError->column - 2 ];
}
define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name
$patternTemplate = '~(?:.*\R){%d}.{%d}\K<~A';
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
while ( false !== $position = getLastErrorPos(XML_ERR_NAME_REQUIRED) ) {
libxml_clear_errors();
$pattern = vsprintf($patternTemplate, $position);
$s = preg_replace($pattern, '<', $s, 1);
$dom = new DOMDocument;
$dom->loadHTML($s, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
}
echo $dom->saveHTML();
libxml_clear_errors();
libxml_use_internal_errors($libxmlErrorState);
$patternTemplate
is a formatted string (see sprintf
in the php manual) in which the placeholders %d
stand for respectively the number of lines before and the position from the start of the line. (0 and 8 here)
Pattern details: The goal of the pattern is to reach the angle bracket position from the start of the string.
~ # my favorite pattern delimiter
(?:
.* # all character until the end of the line
\R # the newline sequence
){0} # reach the desired line
.{8} # reach the desired column
\K # remove all on the left from the match result
< # the match result is only this character
~A # anchor the pattern at the start of the string
An other related question in which I used a similar technique: parse invalid XML manually
Upvotes: 1
Reputation: 27723
My guess is that here we might want to design a good right boundary to capture <
in non-tags, maybe a simple expression similar to:
<(\s*[+-]?[0-9])
might work, since we should normally have numbers or signs right after <
. [+-]?[0-9]
would likely change, if we would have other instances after <
.
$re = '/<(\s*[+-]?[0-9])/m';
$str = '<p>I am <30 years old and weight <12st I am < 30 years old and weight < 12st I am <30 years old and weight < -12st I am < +30 years old and weight < 12st</p>';
$subst = '<$1';
$result = preg_replace($re, $subst, $str);
echo $result;
Upvotes: 0
Reputation: 94652
A simple use of str_replace()
would do it.
<p> and </p>
with [p] and [/p]
<
with <
[p] and [/p]
with <p> and </p>
Code
<?php
$description = "<p>I am <30 years old and weight <12st</p>";
$d = str_replace(['[p]','[/p]'],['<p>','</p>'],
str_replace('<', '<',
str_replace(['<p>','</p>'], ['[p]','[/p]'],
$description)));
echo $d;
RESULT
<p>I am <30 years old and weight <12st</p>
Upvotes: 0
Reputation: 6388
try this
$string = '<p>I am <30 years old and weight <12st</p>';
$html = preg_replace('/^\s*<[^>]+>\s*|\s*<\/[^>]+>\s*\z/', '', $string);// remove html tags
$final = preg_replace('/[^A-Za-z0-9 !@#$%^&*().]/u', '', $html); //remove special character
Upvotes: 0