Reputation: 11
I am tormented by the question concerning the usage of Unicode for a long time. Unicode allows to accelerate and simplify the development of software (in terms of globalization), but I am concerned by the following factors:
With the first paragraph of all it is obvious... But I don't know the true or not the others. Is there anyone who is faced with the need to localize software for Asian countries, and is ready to share the experience?
At the moment I try to use the encoding of a narrow profile (cp1251 - for Russia, cp1254 for Turkey, etc.). Will somebody advice on this issue?
Upvotes: 0
Views: 107
Reputation: 19104
Increased text size, and all of the following are actually untrue.
They may be true, for old-school encodings of unicode, such as UTF-16. UTF-8 is not larger, or slower than ASCII for ASCII-only strings, and yet it allows encoding every Unicode code point. UTF-8 is also a de-facto standard of doing Unicode on the marketplace today.
There is an extensive analysis of performance of different Unicode encodings in http://www.utf8everywhere.org, including for the Asian languages.
Upvotes: 0
Reputation: 201528
Check out the official Unicode FAQ. It has a lot to say about issues like these.
Upvotes: 1
Reputation: 8511
Increased text size: Yes. Text size may be increased up to 6 times (for UTF-8). But storage for texts nowadays is nothing a big problem.
Reduction of text processing performance: As per my opinion, no. An UTF-8 character may take up to 6 bytes, but when scanning thru' the text, and right at the first byte of an UTF-8 character we already know how many bytes more for to read for it (the current character in scanning). So most likely the scanning performance stays the same as O(n), where 'n' is the length of the text. To keep the best performance, try not to access the characters in a text by index (yeah, this is a down-point for performance). Java string is not effected by random index access to string character because Java string is a series of 2-byte characters.
Asian languages treated all alike to the detriment of the national specificities: Yeah, human languages when presented in text format are all alike, but a letter 'i' of a single stroke or a letter of '長' of 16 strokes is just a character.
Upvotes: 0
Reputation: 522005
The first two points are very much negligible. You'd need to have a very specific use case where the difference in size and performance make a discernible difference that justifies the headaches of mixed encodings.
Regarding the Unihan characters: They are grouped by meaning of the character, but that character may be written slightly differently in different writing systems. This is a problem of properly marking up the language, it's not really an encoding problem. In HTML documents, you can mark the document with lang
attributes and/or set specific fonts using CSS which will alter the appearance of the character for the language appropriately. How to handle this correctly depends on the type of software (HTML, desktop app, etc). I'd advise you open a new, detailed question about that.
Upvotes: 0