Reputation: 1082

Regex to remove all empty HTML tags

This is my PHP functions to remove all empty HTML tags from string input:

/**
 * Remove the nested HTML empty tags from the string.
 *
 * @param $string String to remove tags
 * @param null $replaceTo Replace empty string with
 * @return mixed Cleaned string
 */
function crl_remove_empty_tags($string, $replaceTo = null)
{
    // Return if string not given or empty
    if (!is_string($string) || trim($string) == '') return $string;

    // Recursive empty HTML tags
    return preg_replace(
        '/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm',
        !is_string($replaceTo) ? '' : $replaceTo,
        $string
    );
}

My regex: /<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm

I tested it with http://gskinner.com/RegExr/ and http://regexpal.com/, it worked well. But when I tried to run it. Server always returned the error:

Warning: preg_replace(): Unknown modifier '\'

I have no idea what excactly '\' goes wrong. Someone please help me out!

Upvotes: 5

Answers (7)

Ali Hesari

Reputation: 1939

$string = '<p>Some <b>HTML</b> <strong>text. </strong> <hr></p>';
$clean_string = preg_replace('#<[^>]+>#', '', $string);
echo $clean_string; // Some HTML text.

Upvotes: 0

Alejandro Salamanca Mazuelo

Reputation: 1351

Remove empty elements... and the next empty elements.

P.e.

<p>Hello!
   <div class="foo"><p id="nobody">
   </p>
      </div>
 </p>

Results:

<p>Hello!</p>

Php code:

/* $html store the html content */
do {
    $tmp = $html;
    $html = preg_replace( '#<([^ >]+)[^>]*>([[:space:]]|&nbsp;)*</\1>#', '', $html );
} while ( $html !== $tmp );

Upvotes: 3

Casimir et Hippolyte

Reputation: 89557

This pattern is able to remove "empty tags" (i.e. non self-closing tags where that contain nothing, white-spaces, html comments or other "empty tags"), even if these tags are nested like <span><span></span></span>. Tags inside html comments are not taken in account:

$pattern = <<<'EOD'
~
<
(?:
    !--[^-]*(?:-(?!->)[^-]*)*-->[^<]*(*SKIP)(*F) # skip comments
  |
    ( # group 1
        (\w++)     # tag name in group 2
        [^"'>]* #'"# all that is not a quote or a closing angle bracket
        (?: # quoted attributes
            "[^\\"]*(?:\\.[^\\"]*)*+" [^"'>]* #'"# double quote
          |
            '[^\\']*(?:\\.[^\\']*)*+' [^"'>]* #'"# single quote
        )*+
        >
        \s*
        (?:
            <!--[^-]*(?:-(?!->)[^-]*)*+--> \s* # html comments
          |
            <(?1) \s*                          # recursion with the group 1
        )*+
        </\2> # closing tag
    ) # end of the group 1
)
~sxi
EOD;

$html = preg_replace($pattern, '', $html);

Limitations:

This approach will remove links to external Javascript files:
<script src="myscript.js"></script>
The pattern may remove part of embedded Javascript code if something like:
var myvar="<span></span>";
or like:
var myvar1="<span></span>";
is found.

These limitations are due to the fact that a basic text approach is not able to make the difference between html and javascript code. However, it is possible to solve this problem if you add "script" tags in the pattern skip list (in the same way than html comments), but in this case you need to basically describe the Javascript content (strings, comments, literal patterns, all that is not the previous three) that isn't a trivial task but possible.

Upvotes: 5

qualbeen

Reputation: 1582

Here is another way of removing all empty tags. (It also removes surronding tags, if they are condisered empty because of empty childrens:

/**
 * Remove empty tags.
 * This one will also remove <p><a href="/foo/bar.baz"><span></span></a></p> (empty paragraph with empty link)
 * But it will not alter <p><a href="/foo/bar.baz"><span>[CONTENT HERE]</span></a></p> (since the span has content)
 *
 * Be aware: <img ../> will be treated as an empty tag!
 */
do
{
    $len1 = mb_strlen($string);
    $string = preg_replace('/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/', '', $string);
    $len2 = mb_strlen($string);

} while ($len1 > 0 && $len2 > 0 && $len1 != $len2);

I have been using this to sanitize html from an external CMS with positive results.

Upvotes: 0

TALLBOY

Reputation: 1077

You can also use recursion to solve for this. Continue to pass the HTML blob back into the function until the empty tags are no longer present.

public static function removeHTMLTagsWithNoContent($htmlBlob) {
    $pattern = "/<[^\/>][^>]*><\/[^>]+>/";

    if (preg_match($pattern, $htmlBlob) == 1) {
        $htmlBlob = preg_replace($pattern, '', $htmlBlob);
        return self::removeHTMLTagsWithNoContent($htmlBlob);
    } else {
        return $htmlBlob;
    }
}

This will check for the presence of empty HTML tags and replace them until the regex pattern doesn't match anymore.

Upvotes: 0

boesing

Reputation: 345

Not quite sure if that is what u need, but I found this today. You need PHP 5.4+!

$oDOMHTML = DOMDocument::loadHTML( 
    $sYourHTMLString, 
    LIBXML_HTML_NOIMPLIED | 
    LIBXML_HTML_NODEFDTD | 
    LIBXML_NOBLANKS | 
    LIBXML_NOEMPTYTAG 
);
$sYourHTMLStringWithoutEmptyTags = $oDOMHTML->saveXML();

Maybe this works for you.

Upvotes: 0

brandonscript

Reputation: 72905

In php regular expressions you need to escape your delimiters if they occur literally within your expression.

In your case, you have two unescaped /; simply replace them with \/. You also don't need the array of modifiers -- php is global by default, and you have no literal word characters defined.

Before:

/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*/?>\s*</\1\s*>/gixsm

After:

/<(\w+)\b(?:\s+[\w\-.:]+(?:\s*=\s*(?:"[^"]*"|"[^"]*"|[\w\-.:]+))?)*\s*\/?>\s*<\/\1\s*>/
//                                                                    ^       ^

Upvotes: 10

Regex to remove all empty HTML tags

Answers (7)

Related Questions