Zarepheth
Zarepheth

Reputation: 2593

Regular Expressions to fix invalid HTML

I have hundreds of files (ancient ASP and HTML) filled with outdated and often completely invalid HTML code.

Between Visual Studio and ReSharper, this invalid HTML is flagged and easily visible if the editor window is scrolled to where the invalid HTML appears. However, neither tool is providing any method to quickly fix the errors across the whole project.

The first few errors on which ReSharper focuses my attention are tags that are either not closed or closed but not opened. Sometimes this occurs because the opening and closing tags overlap - for instance:

<font face=verdana size=5><b>some text</font></b>

<span><p>start of a paragraph
    with multiple lines of <i><b>text/hmtl
    </i> with a nice mix of junk</b>
</span></p>

Sometimes opening tags without a corresponding closing tag were allowed in older versions of HTML (or the tools which generated the HTML didn't care about the standards as some browsers usually figured out what the author meant). So the mess I'm attempting to clean up has many unclosed HTML tags that ought to be closed.

<font face = tahoma size=2>some more text<b><sup>*</sup></b>
...
...
</body>
</html>

And just for good measure, the code includes lots of closing HTML tags that have no matching start tag.

</b><p>some text that is actually within closed tags</p>
</td>
</tr>
</table>

So, other than writing a new application to parse, flag, and fix all these errors - does anyone have some .Net regular expressions that could be used to locate and preferably fix this stuff with Visual Studio 2012's Search and Replace feature?

Though a single expression that does it all would be nice, multiple expressions that each handle one of the above cases would still be very helpful.

For the case of overlapped HTML tags, I'm using this expression:

(?n)(?<t1s>(?><(?<t1>\w+)[^>]*>))(?<c1>((?!</\k<t1>>)(\n|.))*?)(?<t2s>(?><(?!\k<t1>)(?<t2>(?>\w+))[^>]*>))(?<c2>((?!(</(\k<t1>|\k<t2>)>))(\n|.))*?)(?<t1e></\k<t1>>)(?<c3>(?>(\n|.)*?))(?<t2e></\k<t2>>)

Explanation:
    (?n) Ignore unnamed captures.
    (?<t1s>(?><(?<t1>\w+)[^>]*>)) Get the first tag, capturing the full tag and attributes
      for replacement and the name alone for further matching.
    (?<c1>((?!</\k<t1>>)(\n|.))*?) Capture content between the first and second tag.
    (?<t2s>(?><(?!\k<t1>)(?<t2>(?>\w+))[^>]*>)) Get the 2nd tag, capturing the full
      tag and attributes for replacement, the name along for further matching, and ensuring
      it does not match the 1st tag and that the first tag is still open.
    (?<c2>((?!(</(\k<t1>|\k<t2>)>))(\n|.))*?) Capture content between the second tag 
      closing of the first tag.
    (?<t1e></\k<t1>>) Capture the closing of the first tag, where the second tag is
      still open.
    (?<c3>(?>(\n|.)*?)) Capture content between the closing of the first tag and the closing
      of the second tag.
    (?<t2e></\k<t2>>) Capture the closing of the second tag.

With this replacement expression:

${t1s}${c1}${t2s}${c2}${t2e}${c3}${t1e}

The issues with this search expression is that it is painfully slow. Using . instead of (\n|.) for the three content captures is much quicker, but limits the results to just those where the overlapped tags and intervening content are on a single line.

The expression will also match valid, properly closed and properly nested HTML if the first tag appears inside the content of the second tag, like this:

<font color=green><b>hello world</b></font><span class="whatever"><font color=red>*</font></span>

So it is not safe to use the expression in a "Replace All" operation, especially across the hundreds of files in the solution.

For unclosed tags, I've successfully handled the self-closing tags: <img/>, <meta/>, <input/>, <link/>, <br/>, and <hr/>. However, I've still not attempted the generic case for all the other tags - those that may have content or should be closed with a separate closing tag.

Also, I've no idea how to match closing tags without a matching opening tag. The simple solution of </\w+> will match all closing tags regardless of whether or not they have a matched opening tag.

Upvotes: 2

Views: 1261

Answers (1)

Laurel
Laurel

Reputation: 6173

According to their website, Resharper has this feature:

Solution-Wide Analysis

Not only is ReSharper capable of analyzing a specific code file for errors, but it can extend its analysis skills to cover your whole solution.

...

All you have to do is explicitly switch Solution-Wide Analysis on, and then, after it analyzes the code of your solution, view the list of errors in a dedicated window:

[Many errors here]

Even without opening that window, you can still easily navigate through errors in your solution with Go to Next Error in Solution (Shift+Alt+PageDown) and Go to Previous Error in Solution (Shift+Alt+F12) commands.

Your current "solution" is to use regexes on a context-sensitive language (invalid HTML). Please, NO. People flip out already when people suggest parsing context-free languages with regexes.

On second thought, there might be a solution that we can use regexes for.

For this HTML:

<i><b>text/html
</i> with a nice mix of junk</b>

A better transformation would be (it's more valid, right?):

<i><\i><b><i>text/hmtl
</i> with a nice mix of junk</b>

There are many ways this could go wrong, (although it's pretty bad as-is), but I assume you have this all backed up. This regex (where i is an example of a tag you may want to do this with):

<(i(?: [^>]+)?)>([^<]*)<(\/?[^i](?: [^>]+)?)>

Might help you out. I don't know how regex replace works in whatever flavor you're using, but if you replace $0 (everything matched by the regex) with <$1>$2</$1><$3><$1>, you'll get the transformation I'm talking about.

Upvotes: 1

Related Questions