mingos
mingos

Reputation: 24502

Regex to match text inside HTML tags

I'm trying to write a regex that will remove HTML tags around a placeholder text, so that this:

<p>
    Blah</p>
<p>
    {{{body}}}</p>
<p>
    Blah</p>

Becomes this:

<p>
    Blah</p>
{{{body}}}
<p>
    Blah</p>

My current regex is /<.+>.*\{\{\{body\}\}\}<\/.+>/msU. However, it will also remove the contents of the tag preceding the placeholder, resulting in:

{{{body}}}
<p>
    Blah</p>

I can't assume the users will always place the placeholder inside <p>, so I would like it to remove any pair of tags immediately around the placeholder. I would appreciate some help with correcting my regex.

[EDIT]

I think it's important to note that the input may or may not be processed by CKEditor. It adds newlines and tabs to the opening tags, thus the regex needs to go with the /sm (dotall + multiline) modifiers.

Upvotes: 1

Views: 4626

Answers (2)

Nesim Razon
Nesim Razon

Reputation: 9794

does php strip_tags doesn't work for your case?

http://php.net/manual/en/function.strip-tags.php

<?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "\n";

// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>

Upvotes: 1

Joseph Silber
Joseph Silber

Reputation: 219910

Try this:

<[^>]+>\s*\{{3}body\}{3}\s*<\/[^>]+>

See it here in action: http://regexr.com?30s4o

Here's the breakdown:

  • <[^>]+> matches an opening HTML tag, and only that.
  • \s* captures any whitespace (equivalent to [ \t\r\n]*)
  • \{{3} matches a { exactly 3 times
  • body matches the string literally
  • \}{3} matches a } exactly 3 times
  • \s* again, captures any whitespace
  • <\/[^>]+> matches a closing HTML tag

Upvotes: 5

Related Questions