Rostam N.
Rostam N.

Reputation: 15

Which encoding to use for many international languages

I am setting up a little website and would like to make it international. All the content will be stored in an external xml in different languages and parsed into the html via javascript.

Now the problem is, there are also german umlauts, russian, chinese and japanese symbols and also right-to-left languages like arabic and farsi.

What would be the best way/solution? Is there an "international encoding" which can display all languages properly? Or is there any other solution you would suggest?

Thanks in advance!

Upvotes: 1

Views: 2586

Answers (2)

Stephen C
Stephen C

Reputation: 718708

The normal (and recommended) solution for multi-lingual sites is to use UTF-8. That can can deal with any characters that have been assigned Unicode codepoints with a couple of caveats:

  1. Unicode is a versioned standard, and a different Javascript implementations may support different Unicode versions.

  2. If your text includes characters outside of the Unicode Basic Multilingual Plane (BMP), then you need to do your text processing (in Javascript) in a way that is Unicode aware. For instance, if you use the Javascript String class you need to take proper account of surrogate pairs when doing text manipulation.

(A Javascript String is actually encoded as UTF-16. It has methods that allow you to manipulate it as Unicode codepoints, methods / attribute such as substring and length use codeunit rather than codepoint indexing. If you are not careful, you can end up splitting a string between the low and high parts of a surrogate pair. The result will be something that cannot be displayed properly. This only affects codepoints in higher planes ... but that includes the new emoji codepoints.)

Upvotes: 1

T.J. Crowder
T.J. Crowder

Reputation: 1074028

All of the Unicode transformations (UTF-8, UTF-16, UTF-32) can encode all Unicode characters. You pick which you want to use based on the size: If most of your text is in western scripts, probably UTF-8, as it will use only one byte for most of the characters, but 2, 3, or 4 if needed. If you're encoding far east scripts, you'll probably want one of the other transformations.

The fundamental thing here is that it's all Unicode; the transformations are just different ways of representing the same characters.

The co-founder of Stack Overflow had a good article on this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)


Regardless of what encoding you use for your document, note that if you're doing processing of these strings in JavaScript, JavaScript strings are UTF-16 (except that invalid values are tolerated). (Even if the document is in UTF-8 or UTF-32.) This means that, for instance, each of those emojis people are so excited about these days look like two "characters" to JavaScript, because they take two words of UTF-16 to represent. Like 😎, for instance:

console.log("😎".length); // 2

So you'll need to be careful not to split up the two halves of characters that are encoded in two words of UTF-16.

Upvotes: 3

Related Questions