Reputation: 31

Removing specific html tags using preg_replace without removing content

I'm trying to clean up excess html based on css classes. I don't want to remove all tags of a certain type, just specific tags, and I want to keep the content within them in tact. I'm trying variations along the lines of this:

$content = preg_replace(
    '#(<div class\=\"removethis\">(^.*)</div>)#is', 
    '', 
    $content
);

I realise the above code can't work, but hopefully it'll help explain what I'm trying to do. I'm not that familiar with regular expressions, so I haven't found anything that works so far.

Upvotes: 3

Answers (3)

ridgerunner

Reputation: 34435

Disclaimer: Don't use regex!

It is not recommended to use regular expressions to parse HTML (or any other non-regular language). There are many pitfalls and ways for the solution to fail. That said, I do thoroughly enjoy using regular expressions to solve complex problems such as this one which involves nested structures. If someone else provides a working non-regex solution, I would recommend that you use that one instead of the following.

A regex solution:

The following solution implements a recursive regular expression which is used in conjunction with the preg_replace_callback() function, (which calls itself recursively when the contents of a DIV element contains a nested DIV element). The regular expression matches the outermost DIV element (which may contain nested DIV elements). The callback function strips the start and end tags of only those DIV elements having a class attribute that includes: removethis. The DIV tags that do not have the removethis class are preserved. (The removethis value is stored in a variable at the top of the following working script which can be easily changed to suit.) I think you will find that this does a pretty good job:

function stripSpecialDivTags($text)

<?php // test.php Rev:20111219_1600
// Remove DIV start and end tags having this class attribute:
$class_to_remove = "removethis";
// Recursive regex matches an outermost DIV element and its contents.
$re = '% # Match outermost DIV element.
    <                     # Start of HTML start tag
    (                     # $1: DIV element start tag.
      div                 # Tag name = DIV
      (                   # $2: DIV start tag attributes.
        (?:               # Group for zero or more attributes.
          \s+             # Required whitespace precedes attrib.
          [\w.\-:]+       # Attribute name.
          (?:             # Group for optional attribute value.
            \s*=\s*       # Name and value separated by =
            (?:           # Group for value alternatives.
              \'[^\']*\'  # Either single quoted,
            | "[^"]*"     # or double quoted,
            | [\w.\-:]+   # or unquoted value.
            )             # End group of value alternatives.
          )?              # Attribute value is optional.
        )*                # Zero or more attributes.
      )                   # End $2: DIV start tag attributes.
      \s*                 # Optional whitespace before closing >.
      >                   # End DIV element start tag.
    )                     # End $1: DIV element start tag.
    (                     # $3: DIV element contents.
      (?:                 # Group for zero or more content alts.
        (?R)              # Either a nested DIV element.
      |                   # or non-DIV tag stuff.
        [^<]*             # {normal*} Non-< start of tag stuff.
        (?:               # Begin "unrolling-the-loop".
          <               # {special} A "<", but only if it is
          (?:!/?div)      # NOT start of a <div or </div
          [^<]*           # more {normal*} Non-< start of tag.
        )*                # End {(special normal*)*} construct.
      )*                  # Zero or more content alternatives.
    )                     # End $3: DIV element contents.
    </div\s*>             # DIV element end tag.
    %xi';

// Remove matching start and end tags of DIV elements having specific class.
function stripSpecialDivTags($text) {
    global $re;
    $text = preg_replace_callback($re,
            '_stripSpecialDivTags_cb', $text);
    $text = str_replace("<\0", '<', $text);
    return $text;
}
function _stripSpecialDivTags_cb($matches) {
    global $re, $class_to_remove;
    if (preg_match($re, $matches[3])) {
        $matches[3] = preg_replace_callback($re,
            '_stripSpecialDivTags_cb', $matches[3]);
    }
    // Regex to match class attribute and capture value in $1.
    $re_class = '/ ^      # Anchor to start of attributes string.
        (?:               # Zero or more non-class attributes.
          \s+             # Required whitespace precedes attrib.
          (?!class\b)     # Match any attribute other than "CLASS".
          [\w.\-:]+       # Attribute name.
          (?:             # Group for optional attribute value.
            \s*=\s*       # Name and value separated by =.
            (?:           # Group for value alternatives.
              \'[^\']*\'  # Either single quoted,
            | "[^"]*"     # or double quoted,
            | [\w.\-:]+   # or unquoted value.
            )             # End group of value alternatives.
          )?              # Attribute value is optional.
        )*                # Zero or more non-class attributes.
        \s+               # Required whitespace precedes attrib.
        class\s*=\s*      # "CLASS" is the attribute we need.
        (?|               # Use branch reset to capture value in $1.
          \'([^\']*)\'    # Either $1.1: a single quoted,
        | "([^"]*)"       # or $1.2: a double quoted,
        | ([\w.\-:]+)     # or $1.3: an un-quoted value.
        )                 # End branch reset to capture value in $1.
        /ix';
    $re_remove = '%(?<=^|\s)'.preg_quote($class_to_remove, '%').'(?=\s|$)%';
    if (preg_match($re_class, $matches[2], $m)) {// If DIV has a CLASS,
        if (preg_match($re_remove, $m[1])) { // AND it has special value,
            return $matches[3];     // Then strip start and end DIV tags.
        }
    }
    // Hide the start and end tags by inserting a temporary null char.
    return "<\0". $matches[1] . $matches[3] . "<\0/div>";
}
$data = file_get_contents('testdata.html');
$output = stripSpecialDivTags($data);
file_put_contents('testdata_out.html', $output);
?>

Example Input:

<div class="do not remove">
    <div class=removethis>
        <div>
            <div class='do removethis one too'>
                <div class="dontremovethisone">
                </div>
            </div>
        </div>
    </div>
</div>

Example Output:

<div class="do not remove">

        <div>

                <div class="dontremovethisone">
                </div>

        </div>

</div>

The complexity of the regex is required to properly handle tag attributes having values that may contain <> angle brackets.

Upvotes: 2

maček

Reputation: 77826

Do not parse HTML with regex. You should be using strip_tags

$html = '<div class="foo">Hello world. <b>I am bold!</b></div>';

$allowed_tags = "<b>";

$text = strip_tags($html, $allowed_tags);

echo $text; #=> Hello world. <b>I am bold!</b>

Upvotes: 0

mario

Reputation: 145512

The ^ is likely wrong there. That looks for the start of the subject, not even for the start of a line; and that won't occur at this position.

And you are replacing it with '' nothing, instead of the contents of the first '$1' capture group.

And the off-topic answer for repwhoring: You could alternatively use querypath or another library for managing html content. Then the replacement gets simpler:

  htmlqp($html)->remove("div.removethis")->...()->writeHTML();

Often inappropriate for output transformation. But easier and more useful in other cases.