Reputation: 4002
In the following code, the ü
is not the single Unicode character U+00FC but is a single grapheme cluster composed of two Unicode characters, the plain ASCII u
U+0075 followed by the combining diaeresis U+0308.
fmt.Println("Jürgen Džemal")
fmt.Println("Ju\u0308rgen \u01c5emel")
If I run it in the go playground, it works as expected.
If I run it in a MS Windows 10 "Command Prompt" window, it doesn't visually combine the combining character with the prior character. However when I cut and paste the text into here it appears correctly:
C:\> ver
Microsoft Windows [Version 10.0.17134.228]
C:\> test
Jürgen Džemal
Jürgen Džemel
On screen, in the "Command Prompt" window it looked more like:
Ju¨rgen Džemel
Changing the code page (chcp) from 850 to 65001 made no difference. Changing fonts (Consolas, Courier etc) made no difference.
In the past I have experienced problems that were fundamentally because Microsoft require Windows programs to use a different API to output characters to STDOUT depending on whether STDOUT is attached to a console or to a file. I don't know if this is a different manifestation of the same issue.
Is there something I can do to make this Unicode grapheme-cluster appear correctly?
Upvotes: 2
Views: 1013
Reputation: 4002
As eryksun and Peter commented,
golang.org/x/text/unicode/norm
to do the normalization (e.g. norm.NFC.String("Jürgen Džemal")
)I tried this
s := "Ju\u0308rgen \u01c5emel"
fmt.Println(s) // dieresis not combined with u by conhost.exe
s = norm.NFC.String(s)
fmt.Println(s) // shows correctly
And the output looked like this
or, for the visually impaired with fabulously sophisticated screen readers - a bit like this:
Ju¨rgen Džemel
Jürgen Džemel
Note that Unicode has four different normalised forms but NFC is the most used on the Internet in web-pages and is also appropriate for this situation.
There are other methods in this package that may be more efficient or more useful
I read there are visual-characters in use which can only be represented in Unicode using combining characters. In other words for which there is no precomposed character. A more thorough approach would be needed to do something appropriate with those. Essentially the complications of Unicode (or perhaps more accurately of human languages and their typography) are almost without end. It sometimes seems that way to me.
References
For example, several characters used in writing Lithuanian have double diacritics, as they have only decomposed forms. An example is lowercase U with macron and tilde ("ū̃", U+016b U+0303, where the first code point is a lowercase U with macron and the second is a combining acute accent).
Upvotes: 3