Reputation: 23
I have many long documents that need to be parsed. The document format is like XML but not actually xml.
Here's an example:
<DOC>
<TEXT>it's the content P&G</TEXT>
</DOC>
<DOC>
<TEXT>it's antoher</TEXT>
</DOC>
Note that there are mutiple root tags - <DOC>
, and the entity &
should be &
in xml.
Thus, the above file is not a standard xml.
Can I use the XmlDocument
to parse the file, or should I write my own parser?
Upvotes: 2
Views: 940
Reputation: 41549
As @Oded says, this isn't an XML document - just some text.
However with some pre-parsing you might be able to convert it:
Wrap the whole thing in a new root node:
<DOCS>
<DOC>
<TEXT>it's the content P&G</TEXT>
</DOC>
<DOC>
<TEXT>it's antoher</TEXT>
</DOC>
<DOCS>
And search for the disallowed chars and replace with their entities (eg '
and &
).
As pointed out in the comments, you should replace &
first to avoid double encoding (ie ending up with 'amp;
)
You might have to do this via string manipulation though, depending on where you're getting the data from.
Upvotes: 2
Reputation: 1898
Yes, but you should set XmlReaderSettings.ConformanceLevel
:
XmlReaderSettings settings = new XmlReaderSettings()
{
ConformanceLevel = ConformanceLevel.Fragment
};
using (XmlReader reader = XmlReader.Create(stream, settings))
{
//TODO: read here
}
More: http://msdn.microsoft.com/en-us/library/system.xml.xmlreadersettings.conformancelevel.aspx)
Upvotes: 1
Reputation: 499002
What you are saying is somewhat incorrect - that this is "not standard XML". The document is not XML. Period.
You cannot use XmlDocument
or any other XML parser to parse it as a complete document.
You need to ensure that you have valid XML before you try to parse it using an XML parser.
So - in this case, either warp the document in a root element or break it out to several documents. In either case, you need to ensure that the special characters are encoded correctly (quotes, ampersands etc...).
The answer by oakio gets you part way by treating the document as an XML fragment, but this still doesn't help with invalid content such as unescaped ampersands.
Upvotes: 6