CodeBuddy
CodeBuddy

Reputation: 6639

Is charset=unicode UTF-8, UTF-16 or something else?

When encountering an html document with the following Content-Type:

text/html; charset=unicode

How should this be read?

I'm currently using the value of the charset as the second argument to InputReader's constructor in Java, eg:

inputStreamReader = new InputStreamReader(inputStream, charset);

This seems to read the document as UTF-16, is this correct? I've not been able to find any documentation about the charset name 'unicode' to know what is correct.

Upvotes: 3

Views: 10835

Answers (3)

geogan
geogan

Reputation: 85

Actually, when you export from Microsoft Word as HTML format and look at what it produces, it actually generates:

<meta http-equiv=Content-Type content="text/html; charset=unicode">

Reason I found this is that I had to produce HTML that would be opened in Word and display correctly in MS Word in Dutch, and when I used:

<meta http-equiv=Content-Type content="text/html; charset=utf-8">

MS Word would open the document with incorrect characters (the ë would show as weird chinese symbol), but when I changed it so my HTML said "unicode" instead of "utf-8", then MS Word opened up my HTML and showed correct Dutch characters.

So is MS Word once again doing things wrong? I don't know but that's what I have to output for it to work.

Upvotes: 1

Joop Eggen
Joop Eggen

Reputation: 109547

Unicode is a numbering standard for all (less than 2^24) characters, there are several byte formats: UTF-8 (variable length multibyte), UTF-16LE or UTF-16BE (sequences of 2 bytes) and even others.

What you saw was wrong.

Upvotes: 0

Iłya Bursov
Iłya Bursov

Reputation: 24146

Actually, this is wrong header, there is no such charset as "unicode"

according to Setting the HTTP charset parameter

any token that has a predefined value within the IANA Character Set

These are the official names for character sets that may be used in the Internet and may be referred to in Internet documentation

So, you need either tell developers of this service to fix error, or check actual content and only then suppose it as utf-7/8/16

Upvotes: 5

Related Questions