XmlTextWriter default encoding behaves differently to setting encoding to UTF-8

Question

I am seeing some behaviour I don't expect with XmlTextWriter. When I specifying the encoding when I instantiate the writer by either

new XmlTextWriter(fs, Encoding.UTF8)

or

XmlWriter.Create(fs, new XmlWriterSettings(){Encoding = Encoding.UTF8} )

the document produced has a leading hex character at the start of the document. Since the C++ parser I am passing the XML to cannot read this, I want to avoid this character. Interestingly, when I create the writer like this

new XmlTextWriter(fs, null)

I get the exact behaviour I expect. How do I rectreate this instantiation in code without leaving the parameter null?

groverboy · Accepted Answer

I think the "leading hex character" is a byte order mark (BOM) as I commented on your question, though I can't be sure without actually seeing it. The C++ parser seems not to know about BOMs, which is odd (see standard reference by Joel Spolsky).

Let's assume that the C++ parser works only with XML encoded as UTF-8 or one of its character subsets (ASCII, ISO-8859-1, etc.). In that case you have no option but to encode as UTF-8 but exclude the BOM. XmlWriter lets you do so as follows:

var utf8NoBom = new UTF8Encoding(false);
var writer = XmlWriter.Create(fs, new XmlWriterSettings() { Encoding = utf8NoBom } );

The quote below is from the MSDN reference on XmlWriter.Create:

XmlWriter always writes a Byte Order Mark (BOM) to the underlying data stream; however, some streams must not have a BOM. To omit the BOM, create a new XmlWriterSettings object and set the Encoding property to be a new UTF8Encoding object with the Boolean value in the constructor set to false.

EDIT: If the C++ parser is a general-purpose XML parser then its ignorance of BOMs is odd. If the parser is domain-specific, i.e. if it is always used with files whose character encoding is known (and obviously limited), then its ignorance is not odd. I think this is Spolsky's point.

XmlTextWriter default encoding behaves differently to setting encoding to UTF-8

Answers (1)

Related Questions