RedGrittyBrick
RedGrittyBrick

Reputation: 4002

Go Unicode combining characters (grapheme clusters) and MS Windows Console cmd.exe

In the following code, the is not the single Unicode character U+00FC but is a single grapheme cluster composed of two Unicode characters, the plain ASCII u U+0075 followed by the combining diaeresis U+0308.

fmt.Println("Jürgen Džemal")
fmt.Println("Ju\u0308rgen \u01c5emel")

If I run it in the go playground, it works as expected.

If I run it in a MS Windows 10 "Command Prompt" window, it doesn't visually combine the combining character with the prior character. However when I cut and paste the text into here it appears correctly:

C:\> ver

Microsoft Windows [Version 10.0.17134.228]

C:\> test
Jürgen Džemal
Jürgen Džemel

On screen, in the "Command Prompt" window it looked more like:

Ju¨rgen Džemel

Changing the code page (chcp) from 850 to 65001 made no difference. Changing fonts (Consolas, Courier etc) made no difference.

In the past I have experienced problems that were fundamentally because Microsoft require Windows programs to use a different API to output characters to STDOUT depending on whether STDOUT is attached to a console or to a file. I don't know if this is a different manifestation of the same issue.

Is there something I can do to make this Unicode grapheme-cluster appear correctly?

Upvotes: 2

Views: 1013

Answers (1)

RedGrittyBrick
RedGrittyBrick

Reputation: 4002

As eryksun and Peter commented,

  • The Windows console (conhost.exe) doesn't support combining codes. You'll have to first normalize to an equivalent string that uses precomposed characters.
  • you can use golang.org/x/text/unicode/norm to do the normalization (e.g. norm.NFC.String("Jürgen Džemal"))

I tried this

s := "Ju\u0308rgen \u01c5emel"
fmt.Println(s)              // dieresis not combined with u by conhost.exe
s = norm.NFC.String(s)
fmt.Println(s)              // shows correctly

And the output looked like this

Ju¨rgen Džemel \n Jürgen Džemel

or, for the visually impaired with fabulously sophisticated screen readers - a bit like this:

Ju¨rgen Džemel
Jürgen Džemel

Note that Unicode has four different normalised forms but NFC is the most used on the Internet in web-pages and is also appropriate for this situation.

There are other methods in this package that may be more efficient or more useful

I read there are visual-characters in use which can only be represented in Unicode using combining characters. In other words for which there is no precomposed character. A more thorough approach would be needed to do something appropriate with those. Essentially the complications of Unicode (or perhaps more accurately of human languages and their typography) are almost without end. It sometimes seems that way to me.

References

Upvotes: 3

Related Questions