Reputation: 1082
This is my PHP functions to remove all empty HTML tags from string input:
/**
* Remove the nested HTML empty tags from the string.
*
* @param $string String to remove tags
* @param null $replaceTo Replace empty string with
* @return mixed Cleaned string
*/
function crl_remove_empty_tags($string, $replaceTo = null)
{
// Return if string not given or empty
if (!is_string($string) || trim($string) == '') return $string;
// Recursive empty HTML tags
return preg_replace(
'/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm',
!is_string($replaceTo) ? '' : $replaceTo,
$string
);
}
My regex: /<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm
I tested it with http://gskinner.com/RegExr/ and http://regexpal.com/, it worked well. But when I tried to run it. Server always returned the error:
Warning: preg_replace(): Unknown modifier '\'
I have no idea what excactly '\' goes wrong. Someone please help me out!
Upvotes: 5
Views: 12358
Reputation: 1939
$string = '<p>Some <b>HTML</b> <strong>text. </strong> <hr></p>';
$clean_string = preg_replace('#<[^>]+>#', '', $string);
echo $clean_string; // Some HTML text.
Upvotes: 0
Reputation: 1351
Remove empty elements... and the next empty elements.
P.e.
<p>Hello!
<div class="foo"><p id="nobody">
</p>
</div>
</p>
Results:
<p>Hello!</p>
Php code:
/* $html store the html content */
do {
$tmp = $html;
$html = preg_replace( '#<([^ >]+)[^>]*>([[:space:]]| )*</\1>#', '', $html );
} while ( $html !== $tmp );
Upvotes: 3
Reputation: 89557
This pattern is able to remove "empty tags" (i.e. non self-closing tags where that contain nothing, white-spaces, html comments or other "empty tags"), even if these tags are nested like <span><span></span></span>
. Tags inside html comments are not taken in account:
$pattern = <<<'EOD'
~
<
(?:
!--[^-]*(?:-(?!->)[^-]*)*-->[^<]*(*SKIP)(*F) # skip comments
|
( # group 1
(\w++) # tag name in group 2
[^"'>]* #'"# all that is not a quote or a closing angle bracket
(?: # quoted attributes
"[^\\"]*(?:\\.[^\\"]*)*+" [^"'>]* #'"# double quote
|
'[^\\']*(?:\\.[^\\']*)*+' [^"'>]* #'"# single quote
)*+
>
\s*
(?:
<!--[^-]*(?:-(?!->)[^-]*)*+--> \s* # html comments
|
<(?1) \s* # recursion with the group 1
)*+
</\2> # closing tag
) # end of the group 1
)
~sxi
EOD;
$html = preg_replace($pattern, '', $html);
Limitations:
<script src="myscript.js"></script>
var myvar="<span></span>";
var myvar1="<span><!--";
function doSomething() { alert("!!!"); }
var myvar2="--></span>";
These limitations are due to the fact that a basic text approach is not able to make the difference between html and javascript code. However, it is possible to solve this problem if you add "script" tags in the pattern skip list (in the same way than html comments), but in this case you need to basically describe the Javascript content (strings, comments, literal patterns, all that is not the previous three) that isn't a trivial task but possible.
Upvotes: 5
Reputation: 1582
Here is another way of removing all empty tags. (It also removes surronding tags, if they are condisered empty because of empty childrens:
/**
* Remove empty tags.
* This one will also remove <p><a href="/foo/bar.baz"><span></span></a></p> (empty paragraph with empty link)
* But it will not alter <p><a href="/foo/bar.baz"><span>[CONTENT HERE]</span></a></p> (since the span has content)
*
* Be aware: <img ../> will be treated as an empty tag!
*/
do
{
$len1 = mb_strlen($string);
$string = preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $string);
$len2 = mb_strlen($string);
} while ($len1 > 0 && $len2 > 0 && $len1 != $len2);
I have been using this to sanitize html from an external CMS with positive results.
Upvotes: 0
Reputation: 1077
You can also use recursion to solve for this. Continue to pass the HTML blob back into the function until the empty tags are no longer present.
public static function removeHTMLTagsWithNoContent($htmlBlob) {
$pattern = "/<[^\/>][^>]*><\/[^>]+>/";
if (preg_match($pattern, $htmlBlob) == 1) {
$htmlBlob = preg_replace($pattern, '', $htmlBlob);
return self::removeHTMLTagsWithNoContent($htmlBlob);
} else {
return $htmlBlob;
}
}
This will check for the presence of empty HTML tags and replace them until the regex pattern doesn't match anymore.
Upvotes: 0
Reputation: 345
Not quite sure if that is what u need, but I found this today. You need PHP 5.4+!
$oDOMHTML = DOMDocument::loadHTML(
$sYourHTMLString,
LIBXML_HTML_NOIMPLIED |
LIBXML_HTML_NODEFDTD |
LIBXML_NOBLANKS |
LIBXML_NOEMPTYTAG
);
$sYourHTMLStringWithoutEmptyTags = $oDOMHTML->saveXML();
Maybe this works for you.
Upvotes: 0
Reputation: 72905
In php regular expressions you need to escape your delimiters if they occur literally within your expression.
In your case, you have two unescaped /
; simply replace them with \/
. You also don't need the array of modifiers -- php is global by default, and you have no literal word characters defined.
Before:
/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm
After:
/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/
// ^ ^
Upvotes: 10