Reputation: 4137
I have a string as a HTML source and I want to check whether the HTML source which is string contains a tag which is not opened.
For example the string below contains </u>
after WAVEFORM which has no opening <u>
.
WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,
I just want to check for these types of unopened tag and then I have to append the open tag to the start of the string?
Upvotes: 5
Views: 5020
Reputation: 27
using HtmlDocument = HtmlAgilityPack.HtmlDocument;
/* This function is useful for finding errors in BBCode or HTML. Typical error is "End tag not found line:1, char:592" and notepad can be used to easily locate any errors. I had to append a space after the strIn as a trailing < was not found.*/
public static string HttpParse(string strIn)
{
string strRtn = "";
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(strIn + " ");
foreach(var strErr in htmlDoc.ParseErrors)
{
strRtn += strErr.Reason + Environment.NewLine;
}
return strRtn;
}
Upvotes: 0
Reputation: 57668
For this specific case you can use HTML Agility Pack to assert if the HTML is well formed or if you have tags not opened.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(
"WAVEFORM</u> YES, <u>NEGATIVE AUSCULTATION OF EPIGASTRUM</u> YES,");
foreach (var error in htmlDoc.ParseErrors)
{
// Prints: TagNotOpened
Console.WriteLine(error.Code);
// Prints: Start tag <u> was not found
Console.WriteLine(error.Reason);
}
Upvotes: 7
Reputation: 536369
Not so easy. You can't directly use an HTML parser as it's not valid HTML, but you can't easily throw a regex at the whole thing as regexes can't cope with nesting or other HTML complications.
Probably about the best you could do would be to use a regex to find each markup structure, eg. something like:
<(\w+)(?:\s+[-\w]+(?:\s*(?:=\s*(?:"[^"]*"|'[^']*'|[^'">\s][^>\s]*)))?)*\s*>
|</(\w+)\s*>
|<!--.*?-->
Start with an empty tags-to-open list and an empty tags-to-close list. For each match in the string, look at groups 1 and 2 to see if you've got a start or end tag. (Or a comment, which you can ignore.)
If you've got a start tag, you need to know if it needs closing, ie. if it's one of the EMPTY
content-model tags like <img>
. If a element is EMPTY
, it doesn't need closing so you can ignore it. (If you have XHTML, this is all a bit easier.)
If you have a start-tag, add the tag name in the regex group to the tags-to-close list. If you've got an end tag, take one tag off the end of the tags-to-close list (it should be the same tag name as was on there, otherwise you've got invalid markup. If there are no tags on the tags-to-close list, instead add the tag name to the tags-to-open list.
Once you've got to the end of the input string, prepend each of the tags-to-open tags to the string in reverse order, and append the close tags for the the tags-to-close to the end, again in reverse order.
(Yeah, I'm parsing HTML with regex. I think the nastiness of this demonstrates why you don't want to. If there's anything you can do to avoid having already snipped your markup in the middle of a tag, do that.)
Upvotes: 0