Namey
Namey

Reputation: 1212

JavaScript Library/Function to find Unclosed HTML Tags

I am currently looking for a solution to find and list out any unclosed HTML tags from an arbitrary slice of raw HTML. I don't feel like this should be an awful problem, but I cannot seem to find something that does it in JS. Unfortunately, this needs to be client-side since it is being used for rendering annotations to HTML pages. Obviously, annotations are somewhat nasty business, since they select or apply formatting that may apply to only part of an HTML element (i.e., a markup overlaid onto an existing HTML markup).

One simple use-case is where you might want to only render part of an HTML page, but then inject the rest later. For example, imagine a hypothetical segment:

<p>This is my text <StartDelayedInject/> with a comment I added. </p>
<p> But it doesn't exist until now. </p> <StopDelayedInject/>

I'll be doing some pre-processing to rebuild the HTML so that I wrap partial elements into span-type elements that apply the appropriate formatting. Initially this would be parsed in the form:

<p><span>This is my text</span></p>

After some user action, it would then be modified to a form such as:

<p><span>This is my text</span><span>with a comment I added.</span></p>
<p>But it doesn't exist until now.</p>

This is a very simplified example case (obviously things like ul elements and tables get hairier), but gives the general principle. However, to do this effectively, I need to be able to check a segment of HTML and figure out there are tags that have opened (but not closed). If I know that information, I can wrap the last unterminated text data into a span, close the unclosed tag, and know to return to that point to inject the remainder of the content when needed. However, I need to know the tags that were still open, so that when I inject or modify another segment of content, I can make sure to put it in the right place (e.g., get "with a comment I added." in the first paragraph).

From my understanding of context-free grammars, this should be a relatively trivial task. Each time you open/enter or close/exit a tag, you could just keep a stack of the tags opened but not yet closed. With that said, I'd much rather use a library that's a bit more of a mature solution than make naive parser for that purpose. I'd assume there's some JS HTML parser around that would do this, right? Plenty of them know how to close tags, so so clearly at some point they calculated this.

Upvotes: 5

Views: 6210

Answers (2)

hbruce
hbruce

Reputation: 932

Not perfect but here's my quick method for checking for mismatch between open/close tags:

function find_unclosed_tags(str) {
    str = str.toLowerCase();
    var tags = ["a", "span", "div", "ul", "li", "h1", "h2", "h3", "h4", "h5", "h6", "p", "table", "tr", "td", "b", "i", "u"];
    var mismatches = [];
    tags.forEach(function(tag) { 
        var pattern_open = '<'+tag+'( |>)'; 
        var pattern_close = '</'+tag+'>'; 

        var diff_count = (str.match(new RegExp(pattern_open,'g')) || []).length - (str.match(new RegExp(pattern_close,'g')) || []).length;

        if(diff_count != 0) {
            mismatches.push("Open/close mismatch for tag " + tag + ".");
        }
    });

    return mismatches;
}

Upvotes: 2

rescuecreative
rescuecreative

Reputation: 3859

The problem is that JavaScript only has access to the html in two ways:

  1. In a sense that each element is an object with properties and methods created by the browser on page load.
  2. In a sense that it is a string of text.

Using the first method of interfacing with html, there is no way to detect unclosed tags as you only have access to the objects that the browser creates for you after it parses the html.

Using the second method, you would have to run the entire string of html through an html parser. Some people might assume you could do it simply with regexp, however, this is not feasible. I refer you to this fantastic stackoverflow question.

Even if you found a really robust html parser to use, you would still run into the problem created by the fact that, before your JavaScript even touches it, the browser will have attempted to parse the potentially broken html and there could be errors everywhere.

Edit:

If you like the parser idea, John Resig created this example one you might want to reference.

Upvotes: 4

Related Questions