Reputation: 5370
What is a practical application of having a combining character representation of a symbol in Unicode when a single code point mapping to the symbol will alone suffice?
What programming/non-programming advantage does it give us?
Upvotes: 1
Views: 558
Reputation: 134
The decomposed components are better for text editing, and "possibly but not definite" with good compression ratio.
When editing text, there are times when modifying an accent mark is wanted, but precomposed (precomposed is not a word by Firefox spell check) characters do not allow partial modifications. Sometimes, users may want to modify the base character without removing the accent. Those kinds of editing prefers using decomposed characters.
About compression ratio, it makes more sense during the days of separate encoding per language. In such times, the 8-bit encoding per language allows each language to have their own character sets. Some languages have better compression ratio for decomposed character. The small space of the 8-bits means that they could only fit so many unique code points and use variable width with decomposed characters.
Upvotes: 0
Reputation: 201518
There is no particular programming advantage in using a decomposed presentation (base character and combining character) when a precomposed presentation exists, e.g. using U+0065 U+0065 LATIN SMALL LETTER E U+0301 COMBINING ACUTE ACCENT instead of U+00E9 LATIN SMALL LETTER E WITH ACUTE “é”. Such decomposed presentations are something that needs to be dealt with in programming, part of the problem, rather than an advantage. So it’s similar to asking about the benefits of having the letter U in the character code.
The reasons why decomposed presentations (or the letter U) are used in actual data and need to be handled are external to programming and hence off-topic at SO.
Decomposing all decomposable characters may have advantages in processing, as it makes the data more uniform, canonical. This would relate to some particular features of the processing needed, and it would be implemented by performing (with a library routine, usually) normalization to NFD or NFKD form. But this would normally be part of the processing, not something imposed on input format. If some string matching is performed, it is mostly desirable to treat decomposed and precomposed representations of a character as equivalent, and normalization makes this easy. But this a way of dealing with the two different representations, not a cause for their existence, and it can equally well be performed by normalizing to NFC (i.e., precompose everything that can be precomposed). See the Unicode FAQ section Normalization.
Upvotes: 1