Regular Expressions to fix invalid HTML

Question

I have hundreds of files (ancient ASP and HTML) filled with outdated and often completely invalid HTML code.

Between Visual Studio and ReSharper, this invalid HTML is flagged and easily visible if the editor window is scrolled to where the invalid HTML appears. However, neither tool is providing any method to quickly fix the errors across the whole project.

The first few errors on which ReSharper focuses my attention are tags that are either not closed or closed but not opened. Sometimes this occurs because the opening and closing tags overlap - for instance:

some text

start of a paragraph
    with multiple lines of text/hmtl
     with a nice mix of junk

Sometimes opening tags without a corresponding closing tag were allowed in older versions of HTML (or the tools which generated the HTML didn't care about the standards as some browsers usually figured out what the author meant). So the mess I'm attempting to clean up has many unclosed HTML tags that ought to be closed.

some more text^*
...
...

And just for good measure, the code includes lots of closing HTML tags that have no matching start tag.

some text that is actually within closed tags

So, other than writing a new application to parse, flag, and fix all these errors - does anyone have some .Net regular expressions that could be used to locate and preferably fix this stuff with Visual Studio 2012's Search and Replace feature?

Though a single expression that does it all would be nice, multiple expressions that each handle one of the above cases would still be very helpful.

For the case of overlapped HTML tags, I'm using this expression:

(?n)(?(?><(?\w+)[^>]*>))(?((?!>)(
|.))*?)(?(?><(?!\k)(?(?>\w+))[^>]*>))(?((?!(|\k)>))(
|.))*?)(?>)(?(?>(
|.)*?))(?>)

Explanation:
    (?n) Ignore unnamed captures.
    (?(?><(?\w+)[^>]*>)) Get the first tag, capturing the full tag and attributes
      for replacement and the name alone for further matching.
    (?((?!>)(
|.))*?) Capture content between the first and second tag.
    (?(?><(?!\k)(?(?>\w+))[^>]*>)) Get the 2nd tag, capturing the full
      tag and attributes for replacement, the name along for further matching, and ensuring
      it does not match the 1st tag and that the first tag is still open.
    (?((?!(|\k)>))(
|.))*?) Capture content between the second tag 
      closing of the first tag.
    (?>) Capture the closing of the first tag, where the second tag is
      still open.
    (?(?>(
|.)*?)) Capture content between the closing of the first tag and the closing
      of the second tag.
    (?>) Capture the closing of the second tag.

With this replacement expression:

${t1s}${c1}${t2s}${c2}${t2e}${c3}${t1e}

The issues with this search expression is that it is painfully slow. Using . instead of ( |.) for the three content captures is much quicker, but limits the results to just those where the overlapped tags and intervening content are on a single line.

The expression will also match valid, properly closed and properly nested HTML if the first tag appears inside the content of the second tag, like this:

hello world*

So it is not safe to use the expression in a "Replace All" operation, especially across the hundreds of files in the solution.

For unclosed tags, I've successfully handled the self-closing tags: , , , , , and

. However, I've still not attempted the generic case for all the other tags - those that may have content or should be closed with a separate closing tag.

Also, I've no idea how to match closing tags without a matching opening tag. The simple solution of will match all closing tags regardless of whether or not they have a matched opening tag.

Regular Expressions to fix invalid HTML

Answers (1)

Solution-Wide Analysis

Your current "solution" is to use regexes on a context-sensitive language (invalid HTML). Please, NO. People flip out already when people suggest parsing context-free languages with regexes.

Related Questions