Tal Galili
Tal Galili

Reputation: 25326

PHP Regex to remove HTML-Tags inside <pre></pre> code blocks

I have a tricky string of HTML code that includes several pre tags that inside them include code (say, python), and that are also decorated by HTML tags that should be removed.

For example:

Some text.
<pre>
a = 5 <br/>
b = 3
</pre>
More text
<pre>
a2 = "<a href='something'>text</a>"
b = 3
</pre>
final text

I would like to clean out all the HTML tags (these are likely to be basic tags, br, em, div, a, etc.). I do not need to parse the HTML, I know that regex cannot parse html.

Some text.
<pre>
a = 5
b = 3
</pre>
More text
<pre>
a2 = "text"
b = 3
</pre>
final text

I'd like to do this using PHP (with something like preg_replace). For example:

$html = "<html><head></head><body><div><pre class=\"some-css-class\">
         <p><strong>
         some_code = 1
         </p></strong>
         </pre></div></body>"; // Compacting things here, for brevity

$newHTML = preg_replace("/(.*?)<pre[^<>]*>(.*?)<\/pre>(.*)/Us", "$1".strip_tags("$2", '<p><a><strong>')."$3", $html);
echo $newHTML;

This example code obviously doesn't since: (1) it would work for only one pre tag, and (2) the code strip_tags("$2", '<p><a><strong>') would obviously not work, since it doesn't do the processing of the string in the right location (it would just return "$2" instead of getting the text and manipulating it properly).

Any suggestions on how this could be done in PHP? Thanks.

Upvotes: 1

Views: 529

Answers (1)

anubhava
anubhava

Reputation: 785196

You will need to use preg_replace_callback and call strip_tags in callback body:

preg_replace_callback('~(<pre[^>]*>)([\s\S]*?)(</pre>)~',
function ($m) { return $m[1] . strip_tags($m[2], ['p', 'b', 'strong']) . $m[3]; },
$s);
Some text.
<pre>
a = 5
b = 3
</pre>
More text
<pre>
a2 = "text"
b = 3
</pre>
final text

Note that above strip_tags strips all tags except p, b and strong.

RegEx Details:

  • (<pre[^>]*>): Match <pre...> and capture in group #1
  • ([\s\S]*?): Match 0 or or more of any character including newline (lazy), capture this in group $2. [\s\S] matches any character including newline.
  • (</pre>): Match </pre> and capture in group #3

Upvotes: 3

Related Questions