Reputation: 965
I look for a way to beautify incomplete XML documents. In best case it should handle even large sizes (e.g. 10 MB or maybe 100 MB).
Incomplete means that the documents are truncated at a random position. Until this position the XML has a valid syntax. Beautify means to add line breaks and leading spaces between the tags.
In my case it's needed to analyse aborted streams. Without line breaks and indentions it's really hard to read for a human. I know there are some editors which can beautify incomplete documents, but I want to integrate the beautifier into my own analysis tool.
Unfortunately I did't find a discussion or solution for that case.
The nuget package GuiLabs.Language.Xml
of Kirill Osenkov (repository XmlParser) seems to be a useful candidate for an own beautifier implementation, because it's designed to be error tolerant. Unfortunately there is too less documentation to understand how to use this parser.
Example xml:
<?xml encoding="UTF-8"?><X><B><C>aa</C><B/><A.B><X>bb</X></A.B><A p="pp"/><nn:A>cc</nn:A><D><E>eee</
Expected result as string:
<?xml encoding="UTF-8"?>
<X>
<B>
<C>aa</C>
<B/>
<A.B>
<X>bb</X>
</A.B>
<A p="pp"/>
<nn:A>cc</nn:A>
<D>
<E>eee</
Upvotes: 0
Views: 539
Reputation: 167696
The error ignoring "XML" parser of AngleSharp.Xml can be used to parse your sample, though missing tags will be added, you can then get an XML string representation of the built document and with the help of legacy XmlTextReader and XmlTextWriter which allow you to ignore namespaces you can at least indent the markup:
var xml = @"<?xml encoding=""UTF-8""?><X><B><C>aa</C><B/><A.B><X>bb</X></A.B><A p=""pp""/><nn:A>cc</nn:A><D><E>eee</";
var xmlParser = new XmlParser(new XmlParserOptions() { IsSuppressingErrors = true });
var doc = xmlParser.ParseDocument(xml);
Console.WriteLine(doc.ToMarkup());
using (StringReader sr = new StringReader(doc.ToXml()))
{
using (XmlTextReader xr = new XmlTextReader(sr))
{
xr.Namespaces = false;
using (XmlTextWriter xw = new XmlTextWriter(Console.Out))
{
xw.Namespaces = false;
xw.Formatting = Formatting.Indented;
xw.WriteNode(xr, false);
}
}
}
}
e.g. get
<X>
<B>
<C>aa</C>
<B />
<A.B>
<X>bb</X>
</A.B>
<A p="pp" />
<nn:A>cc</nn:A>
<D>
<E>eee</E>
</D>
</B>
</X>
As your text says "Until this position the XML has a valid syntax" and your comment suggests the errors in your sample are just due to sloppiness I think it might also be possible to use WriteNode
of an XmlWriter
with XmlWriterSettings.Indent
set to true on a standard XmlReader
, as long as you catch the exception the XmlReader
throws:
var xml = @"<?xml version=""1.0""?><root><section><p>Paragraph 1.</p><p>Paragraph 2.";
try
{
using (StringReader sr = new StringReader(xml))
{
using (XmlReader xr = XmlReader.Create(sr))
{
using (XmlWriter xw = XmlWriter.Create(Console.Out, new XmlWriterSettings() { Indent = true }))
{
xw.WriteNode(xr, false);
}
}
}
}
catch (XmlException e)
{
Console.WriteLine();
Console.WriteLine("Malformed input XML: {0}", e.Message);
}
gives
<?xml version="1.0"?>
<root>
<section>
<p>Paragraph 1.</p>
<p>Paragraph 2.</p>
</section>
</root>
Malformed input XML: Unexpected end of file has occurred. The following elements are not closed: p, section, root. Line 1, position 71.
So no need with WriteNode
to handle every possible Readxxx
and node type and call the corresponding Writexxx
on the XmlWriter by you own code.
Upvotes: 1
Reputation: 163488
Does it have to be C#?
In Java, you should be able to pipe the output of a SAX parser into an indenting serializer by connecting a SAXSource to a StreamResult using an identity transformer, and then just make sure that when the SAX parser aborts, you trap the exception and close the output stream tidily.
I think you can probably do the same thing in C# but not quite as conveniently: coupling the events read from an XmlReader and sending the corresponding events to an XmlWriter is a lot more tedious because you have to write code for each separate kind of event.
If you want a C# solution and you're prepared to install Saxon enterprise edition, you can write a simple streaming transformation:
<transform version="3.0" xmlns="http://www.w3.org/1999/XSL/Transform">
<output method="xml" indent="yes"/>
<mode streamable="yes" on-no-match="shallow-copy"/>
</transform>
invoke it from the Saxon API using XsltTransformer with a Serializer as the destination, and again, catch the exception and flush/close the output stream to which the Serializer is writing.
Using Saxon on Java would be overkill because the identity transformer does this "out of the box".
Upvotes: 0