Reputation: 25326
I have a tricky string of HTML code that includes several pre tags that inside them include code (say, python), and that are also decorated by HTML tags that should be removed.
For example:
Some text.
<pre>
a = 5 <br/>
b = 3
</pre>
More text
<pre>
a2 = "<a href='something'>text</a>"
b = 3
</pre>
final text
I would like to clean out all the HTML tags (these are likely to be basic tags, br, em, div, a, etc.). I do not need to parse the HTML, I know that regex cannot parse html.
Some text.
<pre>
a = 5
b = 3
</pre>
More text
<pre>
a2 = "text"
b = 3
</pre>
final text
I'd like to do this using PHP (with something like preg_replace
). For example:
$html = "<html><head></head><body><div><pre class=\"some-css-class\">
<p><strong>
some_code = 1
</p></strong>
</pre></div></body>"; // Compacting things here, for brevity
$newHTML = preg_replace("/(.*?)<pre[^<>]*>(.*?)<\/pre>(.*)/Us", "$1".strip_tags("$2", '<p><a><strong>')."$3", $html);
echo $newHTML;
This example code obviously doesn't since: (1) it would work for only one pre tag, and (2) the code strip_tags("$2", '<p><a><strong>')
would obviously not work, since it doesn't do the processing of the string in the right location (it would just return "$2" instead of getting the text and manipulating it properly).
Any suggestions on how this could be done in PHP? Thanks.
Upvotes: 1
Views: 529
Reputation: 785196
You will need to use preg_replace_callback
and call strip_tags
in callback body:
preg_replace_callback('~(<pre[^>]*>)([\s\S]*?)(</pre>)~',
function ($m) { return $m[1] . strip_tags($m[2], ['p', 'b', 'strong']) . $m[3]; },
$s);
Some text.
<pre>
a = 5
b = 3
</pre>
More text
<pre>
a2 = "text"
b = 3
</pre>
final text
Note that above strip_tags
strips all tags except p
, b
and strong
.
RegEx Details:
(<pre[^>]*>)
: Match <pre...>
and capture in group #1([\s\S]*?)
: Match 0 or or more of any character including newline (lazy), capture this in group $2. [\s\S]
matches any character including newline.(</pre>)
: Match </pre>
and capture in group #3Upvotes: 3