Reputation: 6639
When encountering an html document with the following Content-Type:
text/html; charset=unicode
How should this be read?
I'm currently using the value of the charset as the second argument to InputReader's constructor in Java, eg:
inputStreamReader = new InputStreamReader(inputStream, charset);
This seems to read the document as UTF-16, is this correct? I've not been able to find any documentation about the charset name 'unicode' to know what is correct.
Upvotes: 3
Views: 10835
Reputation: 85
Actually, when you export from Microsoft Word as HTML format and look at what it produces, it actually generates:
<meta http-equiv=Content-Type content="text/html; charset=unicode">
Reason I found this is that I had to produce HTML that would be opened in Word and display correctly in MS Word in Dutch, and when I used:
<meta http-equiv=Content-Type content="text/html; charset=utf-8">
MS Word would open the document with incorrect characters (the ë would show as weird chinese symbol), but when I changed it so my HTML said "unicode" instead of "utf-8", then MS Word opened up my HTML and showed correct Dutch characters.
So is MS Word once again doing things wrong? I don't know but that's what I have to output for it to work.
Upvotes: 1
Reputation: 109547
Unicode is a numbering standard for all (less than 2^24) characters, there are several byte formats: UTF-8 (variable length multibyte), UTF-16LE or UTF-16BE (sequences of 2 bytes) and even others.
What you saw was wrong.
Upvotes: 0
Reputation: 24146
Actually, this is wrong header, there is no such charset as "unicode"
according to Setting the HTTP charset parameter
any token that has a predefined value within the IANA Character Set
So, you need either tell developers of this service to fix error, or check actual content and only then suppose it as utf-7/8/16
Upvotes: 5