Tim
Tim

Reputation: 7056

regex help with replacing <html> tags

I need to extend on the regex below so that it also selects <code> tags with a class, e.g. <code class="lol">

var text = 'This is <i>encoded text</i> but this is <b>bold</b >!';
var html = $('<div/>')
    .text(text)
    .html()
    .replace(new RegExp('&lt;(/)?(b|i|u)\\s*&gt;', 'gi'), '<$1$2>');

Can anyone please help?

I'm guessing something like &lt;(/)?(b|i|u|code|pre)?( class="")\\s*&gt; ??

Many thanks

Upvotes: 1

Views: 2666

Answers (3)

Jaroslav Jandek
Jaroslav Jandek

Reputation: 9563

This will replace the whole tag with everything in it (including class, id, etc.):

.replace(new RegExp('&lt;(/)?(b|u|i|code|pre)(.*?)&gt;', 'gim'), '<$1$2$3>');

Mathing a code tag with a class in encoded string is hard (maybe impossible), it's easy when the code tag is in a fixed format (<code class="whatever">):

.replace(new RegExp('&lt;(?:(code\\sclass=".*?")|(/)?(b|u|i|code|pre)(?:.*?))&gt;', 'gim'), '<$1$2$3>');

Upvotes: 1

user557597
user557597

Reputation:

I wouldn't use a regex for parsing markup, but if its just a string snippet, something like this would be sufficient. It should be noted that the regex your using is overburdened using the \s*. Its optional form could go through the overhead and replace the exact same thing. Better to use \s+

regex: <(/?(?:b|i|u)|code\s[^>]+class\s*=\s*(['"]).*?\2[^>]*?)\s+>
replace: <$1>
modifiers: sgi

<                       # < Opening markup char
   (                       # Capture group 1
       /?                        # optional element termination
       (?:                       # grouping, non-capture
          b|i|u                    # elements 'b', 'i', or 'u'
       )                         # end grouping
    |                         # OR,
       code                      # element 'code' only
       \s [^>]*                  # followed by a space and possibly any chars except '>'
       class \s* = \s*           # 'class' attribute '=' something
         (['"]) .*? \2           # value delimeter, then some possible chars, then delimeter
       [^>]*?                    # followed by possibly any chars not '>'
   )                       # End capture group 1
   \s+                     # Here need 1 or more whitespace, what is being removed
>                      # > Closing markup char

Upvotes: 0

Mark Coleman
Mark Coleman

Reputation: 40863

Parsing html with a regex is a bad idea, see this answer.

The easiest way would to simply use some of jQuery's dom manipulation functions to remove the formating.

$('<div/>').find("b, i, code, code.lol").each(function() {
    $(this).replaceWith($(this).text());
});

Code example on jsfiddle.

Upvotes: 3

Related Questions