user1717043
user1717043

Reputation: 11

Unicode usage in software

I am tormented by the question concerning the usage of Unicode for a long time. Unicode allows to accelerate and simplify the development of software (in terms of globalization), but I am concerned by the following factors:

  1. increased memory and diskspace usage;
  2. reduction of the text processing performance;
  3. Asian languages treated all alike to the detriment of the national specificities.

With the first paragraph of all it is obvious... But I don't know the true or not the others. Is there anyone who is faced with the need to localize software for Asian countries, and is ready to share the experience?

At the moment I try to use the encoding of a narrow profile (cp1251 - for Russia, cp1254 for Turkey, etc.). Will somebody advice on this issue?

Upvotes: 0

Views: 107

Answers (4)

Pavel Radzivilovsky
Pavel Radzivilovsky

Reputation: 19104

Increased text size, and all of the following are actually untrue.

They may be true, for old-school encodings of unicode, such as UTF-16. UTF-8 is not larger, or slower than ASCII for ASCII-only strings, and yet it allows encoding every Unicode code point. UTF-8 is also a de-facto standard of doing Unicode on the marketplace today.

There is an extensive analysis of performance of different Unicode encodings in http://www.utf8everywhere.org, including for the Asian languages.

Upvotes: 0

Jukka K. Korpela
Jukka K. Korpela

Reputation: 201528

  1. The impact on the size of data in bytes is affected by the choice of the Unicode encoding and by the type of data. For example, using UTF-8 (the only useful Unicode encoding on the web), English text has the same size as in 8-bit encodings, except for typographically correct punctuation marks, which may take two bytes each; for Turkish text, any non-Ascii letter is 2 bytes instead of 1 byte; for Russian text, any Cyrillic letter is 2 bytes. In most cases, this does not matter much.
  2. Text processing performance depends on what you do and how you do that. The reasonable expectation is that there is no problem worth worrying about. If processing is fast enough, it hardly matters whether it would be 10% faster using an 8-bit encoding.
  3. Unicode unification has its impact, but surely Asian languages are not treated all alike. The Unicode standard has a lot to say about specific treatment of characters in Asian scripts and languages. If you are referring to the different shapes of CJK characters in different languages, then the usual solution is to use fonts designed for the language used. (In addition, it can in principle at least also be handled within a font, when OpenType fonts are used.)

Check out the official Unicode FAQ. It has a lot to say about issues like these.

Upvotes: 1

jondinham
jondinham

Reputation: 8511

  1. Increased text size: Yes. Text size may be increased up to 6 times (for UTF-8). But storage for texts nowadays is nothing a big problem.

  2. Reduction of text processing performance: As per my opinion, no. An UTF-8 character may take up to 6 bytes, but when scanning thru' the text, and right at the first byte of an UTF-8 character we already know how many bytes more for to read for it (the current character in scanning). So most likely the scanning performance stays the same as O(n), where 'n' is the length of the text. To keep the best performance, try not to access the characters in a text by index (yeah, this is a down-point for performance). Java string is not effected by random index access to string character because Java string is a series of 2-byte characters.

  3. Asian languages treated all alike to the detriment of the national specificities: Yeah, human languages when presented in text format are all alike, but a letter 'i' of a single stroke or a letter of '長' of 16 strokes is just a character.

Upvotes: 0

deceze
deceze

Reputation: 522005

The first two points are very much negligible. You'd need to have a very specific use case where the difference in size and performance make a discernible difference that justifies the headaches of mixed encodings.

Regarding the Unihan characters: They are grouped by meaning of the character, but that character may be written slightly differently in different writing systems. This is a problem of properly marking up the language, it's not really an encoding problem. In HTML documents, you can mark the document with lang attributes and/or set specific fonts using CSS which will alter the appearance of the character for the language appropriately. How to handle this correctly depends on the type of software (HTML, desktop app, etc). I'd advise you open a new, detailed question about that.

Upvotes: 0

Related Questions