subash
subash

Reputation: 4137

Checking a HTML string for unopened tags

I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened.

For example the string below contains </u> after WAVEFORM which has no opening <u>.

WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,

I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string?

Upvotes: 5

Views: 5020

Answers (3)

Joseph Stateson
Joseph Stateson

Reputation: 27

    using HtmlDocument = HtmlAgilityPack.HtmlDocument;

/* This function is useful for finding errors in BBCode or HTML. Typical error is "End tag not found line:1, char:592" and notepad can be used to easily locate any errors. I had to append a space after the strIn as a trailing < was not found.*/

    public static string HttpParse(string strIn)
    {
        string strRtn = "";
        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(strIn + " ");
        foreach(var strErr in htmlDoc.ParseErrors)
        {
            strRtn += strErr.Reason + Environment.NewLine;
        }
        return strRtn;
    }

Upvotes: 0

Jo&#227;o Angelo
Jo&#227;o Angelo

Reputation: 57668

For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(
    "WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");

foreach (var error in htmlDoc.ParseErrors)
{
    // Prints: TagNotOpened
    Console.WriteLine(error.Code);
    // Prints: Start tag <u> was not found
    Console.WriteLine(error.Reason); 
}

Upvotes: 7

bobince
bobince

Reputation: 536369

Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.

Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:

<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->

Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)

If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY content-model tags like <img>. If a element is EMPTY, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)

If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.

Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.

(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)

Upvotes: 0

Related Questions