simgineer
simgineer

Reputation: 1888

Does XML declaration need to be in a specific encoding?

I'm toubleshooting a weihstephen server implementation and am having parsing issues with a commercial test client. I am wondering if my xml document declaration needs to be in a specific encoding.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

the odd thing is the previous developer is writing the xml to the tcp socket interleaving a zero with each character which I'm assuming he's aiming at a unicode/UTF-16 encoding but in the generating code it is set to UTF-8.

Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");

...

packetData[2 * i + 0] = data[i];
packetData[2 * i + 1] = 0;

then the byte array packet data is sent:

dataOutputStream.write(packetData);
dataOutputStream.flush();

so in wireshark the output file looks like this:

.<.?.x.m.l. .v.e.r.s.i.o.n.=.".1...0.". .e.n.c.o.d.i.n.g.=.".U.T.F.-.8.". .s.t.a.n.d.a.l.o.n.e.=.".n.o.".?.>

and I'm wondering if the above is valid and whether the declaration needs to be in a specific encoding say UTF-8 and the rest of the xml document would be the encoding specified by the xml declaration or the xml declaration is simply in the encoding specified by the declaration.

Upvotes: 2

Views: 791

Answers (1)

Michael Kay
Michael Kay

Reputation: 163322

An XML parser uses a variety of techniques to discover the encoding of the file. It may look for a byte order mark at the start, it may look for recognizable patterns in the initial bytes (e.g., what does "<?xml" look like in EBCDIC?) and it may assume that the initial bytes are in ASCII in which case it can read the encoding attribute in the XML declaration. Some of these things are prescribed by the spec, others are left implementation-defined.

If two of these techniques give different answers, e.g. if the file is actually in UTF-16 but the XML declaration says it's in UTF-8, that doesn't technically make the XML ill-formed, but it does mean the parser may not be able to make head or tail of it.

Trying to manually generate UTF-16 by inserting zero bytes looks like a really bad idea.

Upvotes: 3

Related Questions