Reputation: 9982
I'm working on some code to determine the character encoding of an XML document being returned by a web server (an RSS feed in this particular case). Unfortunately, sometimes the web server lies and tells me that the document is UTF-8 when in fact it's not, or the boilerplate XML generation code on the server has <?xml encoding='UTF-8'?>
at the start but the document contains invalid UTF-8 byte sequences.
Since I don't have control over the server, I need to make my client code tolerate this kind of inconsistency and show something, even if some of the characters are not decoded correctly. This is an important requirement for my application.
I'm well aware that the server is violating the XML spec in this case. I try to work with the server side developers when possible to make things correct according to the spec, but sometimes this is a low priority for them or for their organization, or the server side code is not actively maintained by anyone.
In order to be robust, I want to look at the first few bytes of the XML data and try to determine if it's some form of UTF-16 or some 8-bit encoding. I already have code that looks for a byte order mark (BOM).
But sometimes the server doesn't include a BOM, even for UTF-16. I want to try and figure out if it's UTF-16 or not by looking at the first two bytes and checking them against the list of possible first characters in an XML document.
Obviously I have to draw the line somewhere. If the document is not well-formed XML I won't be able to parse it anyway unless I write my own very tolerant parser (which I'm not planning to do). But given that it's well-formed, what could I possibly see in the first character of the document aside from a BOM?
So far as I can tell from looking at the spec, this set would be: whitespace (space, tab, new line, carriage return) and '<'. Do any XML experts out there know of anything I might be missing? I need to assume that the <?xml?>
declaration may not be present even if required by the spec.
Internal DTDs, processing instructions, tags and comments all start with '<'. Is it possible to have an entity (starting with '&') or something else at the start of a document?
EDIT: Rewritten to emphasize my particular requirements.
Upvotes: 1
Views: 943
Reputation: 12299
It's not ideal, but I sometimes do this when I need to cope with bad encodings (pseduo-code alert).
str = decode("utf-8", input)
if (!str) {
str = decode("cp1252", input)
}
That is, try to interpret the input as UTF-8, and if it fails, treat it as coming from a Windows system (which it probably is). It seems like a reasonable compromise to me.
Of course, this does require that you download the entire input into memory first, which may not be practical.
Upvotes: 0
Reputation: 59563
The XML Specification provides some guidance about detecting character encodings. The problem is that it is nearly impossible to look at the first few bytes and tell if it is UTF-8 or ISO-8859-1 or CP437 for that matter. The information that the spec contains will at least let you distinguish well-formed documents.
Upvotes: 2
Reputation: 200826
The trouble is that if a feed is invalid, it probably doesn't obey any rules about legal characters. Take a look at the code for the Universal Feed Parser. It's very well-tested code for parsing garbage text into possibly-correct data structures.
The UFP uses a sub-library named Universal Encoding Detector, which should contain useful information for general encoding detection.
Upvotes: 1