RealityDysfunction
RealityDysfunction

Reputation: 2639

add slash to self-closing tags

I need to parse a chunk of html, I obtain from a page, into an xml. Most of the tags convert fine when I put them into XmlDocument, except self-closing tags that are not closed (xmlDocument does not like those). Unfortunately I cannot add these in the page itself, since it is generated by a third party engine. So I have to add them myself. I am not that great at Regex so I need some help on how to add these "/" to one of these

Appreciate any input.

Upvotes: 1

Views: 1011

Answers (3)

John Koerner
John Koerner

Reputation: 38077

I would recommend using the HTML Agility Pack to parse it. The pack has the ability to write to XML and will take care of all of the closing of tags for you (as well as CDATA wrapping and other tricky problems you may run into). For example, this is how you can convert HTML to XML:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

string HTML = "<HTML><body><a href ='something'> <img src='a.jpg'></a></HTML>";

doc.LoadHtml(HTML);
MemoryStream ms = new MemoryStream();
XmlWriter xml = XmlWriter.Create(ms);
doc.OptionOutputAsXml = true;
doc.Save(xml);

ms.Position = 0;
StreamReader sr = new StreamReader(ms);
Debug.WriteLine (sr.ReadToEnd());

Which renders the output:

<?xml version="1.0" encoding="iso-8859-1"?><html><body><a href="something"> <img src="a.jpg" /></a></body></html>

Upvotes: 4

bjhamltn
bjhamltn

Reputation: 410

For non-standard tags you may have to add the tag name to the HtmlAgilityPack.HtmlNode.ElementsFlags.

Ex. HtmlAgilityPack.HtmlNode.ElementsFlags.Add("spanspec", HtmlElementFlag.Empty);

Upvotes: 0

Mitch
Mitch

Reputation: 22251

HTML is not XML. Don't try. It won't work. Even if it works now, it won't tomorrow. If you'd like an example, try parsing the following as XML, even though it is perfectly valid HTML:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
  "http://www.w3.org/TR/html4/strict.dtd"> 
<HTML/
  <HEAD/
    <TITLE/>/
     <P/>

Use an HTML parser; I can recommend HTML agility pack.

Upvotes: 0

Related Questions