rashadb
rashadb

Reputation: 2543

With Javascript how do you remove a tag from a reg expression that is html with multiple tags

I have a string of html that I want to deploy without the <img /> What I have currently is:

var myHTML = "<p><img class="alignnone size-full wp-image-2857" 
src="https://files.wordpress.com/2016/05/laptop.jpg?w=750&#038;h=545" 
alt="https://pixabay.com/en/laptop-printer-office-folder-graph-1016257/" 
width="750" height="545" /></p> <p>STUFF</p> <p>MORE STUFF</p> <p>EVEN MORE 
STUFF</p> <p><strong><span style="text-decoration:underline;">OTHER 
STUFF</span></strong></p> <p><em>OTHER STUFF</em>: DEMO STUFF</p> <p>
<em>TEST STUFF</em>: WRAP UP STUFF</p> <p><strong><span style="text-
decoration:underline;">REST OF STUFF</span></strong></p> <p><em>Aloha 
POS</em>: KEEP THIS STUFF TOO</p> <p><em>Revel</em>: WHAT STUFF</p> <p> 
DONE</p> "

What I think it should look like:

var myHTML2 = "<p></p> <p>STUFF</p> <p>MORE STUFF</p> <p>EVEN MORE 
    STUFF</p> <p><strong><span style="text-decoration:underline;">OTHER 
    STUFF</span></strong></p> <p><em>OTHER STUFF</em>: DEMO STUFF</p> <p>
    <em>TEST STUFF</em>: WRAP UP STUFF</p> <p><strong><span style="text-
    decoration:underline;">REST OF STUFF</span></strong></p> <p><em>Aloha 
    POS</em>: KEEP THIS STUFF TOO</p> <p><em>Revel</em>: WHAT STUFF</p> <p> 
    DONE</p> "

What I tried:

myHTML.replace(/<(?!\s*\/?\s*p\b)[^>]*>/gi,'') 

But this strips all of the html from the string and I only want to remove the <img /> tag.

Upvotes: 0

Views: 58

Answers (3)

Ro Yo Mi
Ro Yo Mi

Reputation: 14990

Forward

It's not advisable to use a regex to parse HTML due to all the possible obscure edge cases that can crop up, but it seems that you have some control over the HTML so you should able to avoid many of the edge cases the regex police cry about.

Description

<img\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>

Regular expression visualization

Replace with: nothing

This regex will do the following:

  • match the entire img tag to include any sub attributes
  • avoid difficult edge cases that makes dealing with hmtl difficult

Examples

Live demo https://regex101.com/r/pG1oI7/1

Sample String

<p><img class="alignnone size-full wp-image-2857" 
src="https://files.wordpress.com/2016/05/laptop.jpg?w=750&#038;h=545" 
alt="https://pixabay.com/en/laptop-printer-office-folder-graph-1016257/" 
width="750" height="545" /></p> <p>STUFF</p> <p>MORE STUFF</p> <p>EVEN MORE 
STUFF</p> <p><strong><span style="text-decoration:underline;">OTHER 
STUFF</span></strong></p> <p><em>OTHER STUFF</em>: DEMO STUFF</p> <p>
<em>TEST STUFF</em>: WRAP UP STUFF</p> <p><strong><span style="text-
decoration:underline;">REST OF STUFF</span></strong></p> <p><em>Aloha 
POS</em>: KEEP THIS STUFF TOO</p> <p><em>Revel</em>: WHAT STUFF</p> <p> 
DONE</p> 

After Replacement

<p></p> <p>STUFF</p> <p>MORE STUFF</p> <p>EVEN MORE 
STUFF</p> <p><strong><span style="text-decoration:underline;">OTHER 
STUFF</span></strong></p> <p><em>OTHER STUFF</em>: DEMO STUFF</p> <p>
<em>TEST STUFF</em>: WRAP UP STUFF</p> <p><strong><span style="text-
decoration:underline;">REST OF STUFF</span></strong></p> <p><em>Aloha 
POS</em>: KEEP THIS STUFF TOO</p> <p><em>Revel</em>: WHAT STUFF</p> <p> 
DONE</p> 

Explained

NODE                     EXPLANATION
----------------------------------------------------------------------
  <img                     '<img'
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  >                        '>'

Upvotes: 1

J Langowski
J Langowski

Reputation: 11

This is not a regex answer but if you are already using javascript you can use what javascript was designed for and manipulate the DOM directly like this

var html = '<p><img class="alignnone size-full wp-image-2857" src="https://files.wordpress.com/2016/05/laptop.jpg?w=750&#038;h=545" alt="https://pixabay.com/en/laptop-printer-office-folder-graph-1016257/" width=\"750" height="545" /></p> <p>STUFF</p> <p>MORE STUFF</p> <p>EVEN MORE STUFF</p> <p><strong><span style="text-decoration:underline;">OTHER STUFF</span></strong></p> <p><em>OTHER STUFF</em>: DEMO STUFF</p> <p><em>TEST STUFF</em>: WRAP UP STUFF</p> <p><strong><span style="text-decoration:underline;">REST OF STUFF</span></strong></p> <p><em>Aloha POS</em>: KEEP THIS STUFF TOO</p> <p><em>Revel</em>: WHAT STUFF</p> <p>DONE</p>';

var el = document.createElement('div');
el.innerHTML = html;
var p = el.getElementsByTagName('p')[0]; // the first one where the image is
var img = p.getElementsByTagName('img')[0]; // there is only one might want to use id or class to be more specific 
console.log(img);
p.removeChild(img); //have to remove from the first ancestor or parent

You will want to use classes or id's if you are going to have lots of images.

Upvotes: 1

Laurel
Laurel

Reputation: 6173

You could use this regex to remove the img tag:

<img[^>]+>

I don't know what you were trying to do with the regex you had, honestly. It doesn't need to be complicated, the only "regex construct" that I had to use was [^>]+, which just matches characters that aren't >.

The benefit of using a simple regex is readability and speed. Of course, if you wanted to account for edge cases, (such as false positives in embedded JS), you should use a HTML parser.

Upvotes: 1

Related Questions