daisydan
daisydan

Reputation: 23

C# How to parse non-standard xml

I have many long documents that need to be parsed. The document format is like XML but not actually xml.

Here's an example:

<DOC>
    <TEXT>it's the content P&G</TEXT>
</DOC> 
<DOC>
    <TEXT>it's antoher</TEXT>
</DOC>

Note that there are mutiple root tags - <DOC>, and the entity & should be &amp; in xml.

Thus, the above file is not a standard xml.

Can I use the XmlDocument to parse the file, or should I write my own parser?

Upvotes: 2

Views: 940

Answers (3)

Jon Egerton
Jon Egerton

Reputation: 41549

As @Oded says, this isn't an XML document - just some text.

However with some pre-parsing you might be able to convert it:

Wrap the whole thing in a new root node:

<DOCS>
    <DOC>
        <TEXT>it's the content P&G</TEXT>
    </DOC> 
    <DOC>
        <TEXT>it's antoher</TEXT>
    </DOC>
<DOCS>

And search for the disallowed chars and replace with their entities (eg &apos; and &amp;).

As pointed out in the comments, you should replace & first to avoid double encoding (ie ending up with &apos;amp;)

You might have to do this via string manipulation though, depending on where you're getting the data from.

Upvotes: 2

oakio
oakio

Reputation: 1898

Yes, but you should set XmlReaderSettings.ConformanceLevel:

XmlReaderSettings settings = new XmlReaderSettings()
{
    ConformanceLevel = ConformanceLevel.Fragment
};
using (XmlReader reader = XmlReader.Create(stream, settings))
{
    //TODO: read here
}

More: http://msdn.microsoft.com/en-us/library/system.xml.xmlreadersettings.conformancelevel.aspx)

Upvotes: 1

Oded
Oded

Reputation: 499002

What you are saying is somewhat incorrect - that this is "not standard XML". The document is not XML. Period.

You cannot use XmlDocument or any other XML parser to parse it as a complete document.

You need to ensure that you have valid XML before you try to parse it using an XML parser.

So - in this case, either warp the document in a root element or break it out to several documents. In either case, you need to ensure that the special characters are encoded correctly (quotes, ampersands etc...).

The answer by oakio gets you part way by treating the document as an XML fragment, but this still doesn't help with invalid content such as unescaped ampersands.

Upvotes: 6

Related Questions