Reputation: 2215

PHP get html comments in string and wrap in <pre> tag. Regex or DOM?

I would like to find comment tags in a string that are NOT already inside a <pre> tag, and wrap them in a <pre> tag.

It seems like there's no way of 'finding' comments using the PHP DOM.

I'm using regex to do some of the processing already, however I am very unfamiliar with (have yet to grasp or truly understand) look aheads and look behinds in regex.

For instance I may have the following code;

<!-- Comment 1 -->

<pre>
    <div class="some_html"></div>
    <!-- Comment 2 -->
</pre>

I would like to wrap Comment 1 in <pre> tags, but obviously not Comment 2 as it already resides in a <pre>.

How would this usually be done in RegEx?

Here's kind of what I've understood about negative look arounds, and my attempt at one, I'm clearly doing something very wrong!

(?<!<pre>.*?)(?!.*?</pre>)

Upvotes: 2

Answers (4)

user257319

Reputation:

its quite easy, using a principle called the stack-counter,
essentially you count the amount of <pre> tags and the amount of </pre> tags until the point in the HTML code your segment is placed.
if there are more <pre> than </pre> - this means that "<pre>..--you are here--..</pre>".
in that case, simply return back the match, unmodified - simple as that.

Upvotes: 0

pguardiario

Reputation: 54984

Xpath is your friend:

$xpath = new DOMXpath($doc);

foreach($xpath->query('//comment()[not(ancestor::pre)]') as $comment){
  $pre = $doc->createElement("pre");
  $comment->parentNode->insertBefore($pre, $comment);
  $pre->appendChild($comment);
}

Upvotes: 0

Enissay

Reputation: 4953

It seems like there's no way of 'finding' comments using the PHP DOM.

Of course you can... Check this code using PHP Simple HTML DOM Parser:

<?php
$text = '<!-- Comment 1 -->

        <pre>
            <div class="some_html"></div>
            <!-- Comment 2 -->
        </pre>';

echo  "<div>Original Text: <xmp>$text</xmp></div>";

$html = str_get_html($text);

$comments = $html->find('comment');

// if find exists
if ($comments) {

  echo '<br>Find function found '. count($comments) . ' results: ';

  foreach($comments as $key=>$com){
    echo '<br>'.$key . ': ' . $com->tag . ' wich contains = <xmp>' . $com->innertext . '</xmp>';
  }
}
else
  echo "Find() fails !";
?>

$com->innertext will give you the comments like ...

You have now just to clean them as you wish. For example using ... Try it HERE

Edit:

Just a note concerning the lookbehind, it MUST have a fixed-width, therefore you cannot use repetition *+ or optional items ?

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

Source: http://www.regular-expressions.info/lookaround.html

Upvotes: 1

Jens

Reputation: 25563

You should really use a DOM parser if you are planning on re-using this code. Every regex approach will fail horribly sooner rather than later when presented with real-world HTML.

Having said that, here's what you could (but should not, see above) do:

First, identify comments, e.g. using

<!-- (?:(?!-->).)*-->

The negative look-ahead block ensures that the .* does not run out of the comment block.

Now, you need to figure out if this comment is inside a <pre> block. The key observation here, is that there is an even number of either <pre> or </pre> elements following every comment NOT already included in one.

So, run through the rest of your text, always in pairs of <pre>s, and check if you arrive at the end.

This would look like

(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

So, together this would be

<!-- (?:(?!-->).)*-->(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

A hurray for write-only code =)

The prominent building block of this expression is (?:(?!</?pre>).) which matches every character that is not the starting bracket of a <pre> or </pre> sequence.

Allowing attributes on the <pre> and proper escaping are left as an exercise for the reader. See this in action at RegExr.

Upvotes: 2

PHP get html comments in string and wrap in &lt;pre&gt; tag. Regex or DOM?

Answers (4)

Edit:

Related Questions

PHP get html comments in string and wrap in <pre> tag. Regex or DOM?