Reputation: 113

Remove empty tags using RegEx

I want to delete empty tags such as <label></label>, <font> </font> so that:

<label></label><form></form>
<p>This is <span style="color: red;">red</span> 
<i>italic</i>
</p>

will be cleaned as:

<p>This is <span style="color: red;">red</span> 
<i>italic</i>
</p>

I have this RegEx in javascript, but it deletes the the empty tags but it also delete this: "<i>italic</i></p>"

str=str.replace(/<[\S]+><\/[\S]+>/gim, "");

What I am missing?

Upvotes: 11

Answers (12)

Hoàng Vũ Tgtt

Reputation: 2032

if just want to remove all empty tags

  html = html.replace(/<([A-z]+)([^>^/]*)>\s*<\/\1>/gim, '');

but carefull, sometimes, table will wrong display. So if you want to remove empty html tags except and in javascript, we use callback

html = html.replace(/<([A-z]+)([^>^/]*)>\s*<\/\1>/gim, function(match, p1, p2) {
  if (p1 === 'tr' || p1 === 'td') {
    return match;
  } else {
    return '';
  }
});

Upvotes: 0

srghma

Reputation: 5333

<([^>]+)\s*>\s*<\/\1\s*>

<div>asdf</div>
<div></div> -- will match only this
<div></notdiv>
-- and this
<div  >  
    </div   >

try yourself https://regexr.com/

Upvotes: 2

07mm8

Reputation: 3298

remove empty tags with cheerio will and also removing images:

  $('*')
    .filter(function(index, el) {
      return (
        $(el)
          .text()
          .trim().length === 0
      )
    })
    .remove()

remove empty tags with cheerio, but also keep images:

  $('*')
    .filter(function(index, el) {
      return (
        el.tagName !== 'img' &&
        $(el).find(`img`).length === 0 &&
        $(el)
          .text()
          .trim().length === 0
      )
    })
    .remove()

Upvotes: 0

porges

Reputation: 30580

You have "not spaces" as your character class, which means "<i>italic</i></p>" will match. The first half of your regex will match "<(i>italic</i)>" and the second half "</(p)>". (I've used brackets to show what each [\S]+ matches.)

Change this:

/<[\S]+><\/[\S]+>/

To this:

/<[^/>][^>]*><\/[^>]+>/

Overall you should really be using a proper HTML processor, but if you're munging HTML soup this should suffice :)

Upvotes: 25

toastrackengima

Reputation: 8732

Here's a modern native JavaScript solution; which is actually quite similar to the jQuery one from 2010. I adapted it from that answer for a project that I am working on, and thought I would share it here.

document.querySelectorAll("*:empty").forEach((x)=>{x.remove()});

document.querySelectorAll returns a NodeList; which is essentially an array of all DOM nodes which match the CSS selector given to it as an argument.
- *:empty is a selector which selects all elements (* means "any element") that is empty (which is what :empty means).
  
  This will select any empty element within the entire document, if you only wanted to remove any empty elements from within a certain part of the page (i.e. only those within some div element); you can add an id to that element and then use the selector #id *:empty, which means any empty element within the element with an id of id.
  
  This is almost certainly what you want. Technically some important tags (e.g. <meta> tags, <br> tags, <img> tags, etc) are "empty"; so without specifying a scope, you will end up deleting some tags you probably care about.
forEach loops through every element in the resulting NodeList, and runs the anonymous function (x)=>{x.remove()} on it. x is the current element in the list, and calling .remove() on it removes that element from the DOM.

Hopefully this helps someone. It's amazing to see how far JavaScript has come in just 8 years; from almost always needing a library to write something complex like this in a concise manner to being able to do so natively.

Edit

So, the method detailed above will work fine in most circumstances, but it has two issues:

Elements like <div> </div> are not treated as :empty (not the space in-between). CSS Level 4 selectors fix this with the introduction of the :blank selector (which is like empty except it ignores whitespace), but currently only Firefox supports it (in vendor-prefixed form).
Self-closing tags are caught by :empty - and this will remain the case with :blank, too.

I have written a slightly larger function which deals with these two use cases:

document.querySelectorAll("*").forEach((x)=>{
    let tagName = "</" + x.tagName + ">";
    if (x.outerHTML.slice(tagName.length).toUpperCase() == tagName
        && /[^\s]/.test(x.innerHTML)) {
        x.remove();
    }
});

We iterate through every element on the page. We grab that element's tag name (for example, if the element is a div this would be DIV, and use it to construct a closing tag - e.g. </DIV>.

That tag is 6 characters long. We check if the upper-cased last 6 characters of the elements HTML matches that. If it does we continue. If it doesn't, the element does't have a closing tag, and therefore must be self-closing. This is preferable over a list, because it means you don't have to update anything should a new self-closing tag get added to the spec.

Then, we check if the contents of the element contain any whitespace. /[^\s]/ is a RegEx. [] is a set in RegEx, and will match any character that appears inside it. If ^ is the first element, the set becomes negated - it will match any element that is NOT in the set. \s means whitespace - tabs, spaces, line breaks. So what [^\s] says is "any character that is not white space".

Matching against that, if the tag is not self-closing, and its contents contain a non-whitespace character, then we remove it.

Of course, this is a bit bigger and less elegant than the previous one-liner. But it should work for essentially every case.

Upvotes: 2

P070

Reputation: 189

found this on code pen: jQuery though but does the job

$('element').each(function() {
  if ($(this).text() === '') {
    $(this).remove();
  }
});

You will need to alter the element to point to where you want to remove empty tags. Do not point at document cause it will result in my answer at Toastrackenigma

Upvotes: 0

user3752734

Reputation: 1

You can use this one text = text.replace(/<[^/>][^>]>\s</[^>]+>/gim, "");

Upvotes: 0

Civa

Reputation: 2176

All the answers with regex are only validate

<label></label>

but in the case of

<label> </label>
<label>    </label>
<label>
</label>

try this pattern to get all the above

<[^/>]+>[ \n\r\t]*</[^>]+>

Upvotes: 9

Rodrick Chapman

Reputation: 5543

I like MattMitchell's jQuery solution but here is another option using native JavaScript.

function CleanChildren(elem)
{
    var children = elem.childNodes;
    var len = elem.childNodes.length;

    for (var i = 0; i < len; i++)
    {
        var child = children[i];

        if(child.hasChildNodes())
            CleanChildren(child);
        else
            elem.removeChildNode(child);

    }
}

Upvotes: 2

Matt Mitchell

Reputation: 41823

Regex is not for HTML. If you're in JavaScript anyway I'd be encouraged to use jQuery DOM processing.

Something like:

$('*:empty').remove();

Alternatively:

$("*").filter(function() 
{ 
     return $.trim($(this).html()).length > 0; 
}).remove();

Upvotes: 23

Jamie Wong

Reputation: 18350

This is an issue of greedy regex. Try this:

str=str.replace(/<[\^>]+><\/[\S]+>/gim, "");

str=str.replace(/<[\S]+?><\/[\S]+>/gim, "");

In your regex, <[\S]+?> matches <i>italic</i> and the <\/[\S]+> matches the </p>

Upvotes: 1

Alex Martelli

Reputation: 881735

You need /<[\S]+?><\/[\S]+?>/ -- the difference is the ?s after the +s, to match "as few as possible" (AKA "non-greedy match") nonspace characters (though 1 or more), instead of the bare +s which match"as many as possible" (AKA "greedy match").

Avoiding regular expressions altogether, as the other answer recommends, is also an excellent idea, but I wanted to point out the important greedy vs non-greedy distinction, which will serve you well in a huge variety of situations where regexes are warranted.

Upvotes: 3

Remove empty tags using RegEx

Answers (12)

Edit

Related Questions