Amyth
Amyth

Reputation: 32949

Javascript regex to strip selective html tags

I am trying to strip all (except some) html tags from a string using regex. What I am trying currently trying is as follows:

var a = "<pre><code><p>This is a <span>test</span></p></code></pre>";
var b = a.replace(/(\<|\<\/)[^\>,p,li,br,b]*\>/ig,"");

but b's value is returned as "<pre><p>This is a <span>test</span></p></pre>"

It seems any tags that contain even a single instance of [>,p,li,br,b] are not being removed, as here all pre, span and p contains the character p. However, I only want to ignore the exact characters in [^\>,p,li,br,b].

The output I am expecting is "<p>This is a test</p>".

what am I doing wrong ?

Upvotes: 0

Views: 569

Answers (2)

Nicholas Daley-Okoye
Nicholas Daley-Okoye

Reputation: 2397

var a = "<pre><code><p>This is a <span>test</span></p></code></pre>";
var b = a.replace(/\<(?!\/?(p|li|br|b)[ >])[^>]*\>/ig,"");

This regex matches the leading < or </ only if it is not followed by one of the tag names you want to keep p, li, br, b (followed by a space or closing >, so that it doesn't think <pre> is <p>).

Then it matches everything up to the closing >.

Upvotes: 3

vol7ron
vol7ron

Reputation: 42089

See this answer.

That said, square brackets [] match on single characters, not words - for more information on what yours is doing, see the bottom of this answer. Instead, you would need to use parentheses (?:p|li|br|b) to match words - the ?: is used to avoid capturing. Also, the parentheses would occur outside of the square brackets.

Since you're using a negative match you may wish to look into lookarounds; specifically, the section on Positive and Negative Lookbehind.


[^\>,p,li,br,b] translates to not > and not , and not p and not , and not l and not i and not , and not b and not r and not , and not b.

Upvotes: 1

Related Questions