Michael
Michael

Reputation: 7123

How can I get the Unicode value of a character in go?

I try to get the unicode value of a string character in Go as an Int value.

I do this:

value = strconv.Itoa(int(([]byte(char))[0]))

where char contains a string with one character.

That works for many cases. It doesn't work for umlauts like ä, ö, ü, Ä, Ö, Ü.

E.g. Ä results in 65, which is the same as for A.

How can I do that?

Supplement: I had two problems. The first was solved with any of the answers below. The second was a bit more tricky. My input was not Go normalized UTF-8 code, e.g. umlauts were represented by two characters instead of one. As ANisus said the solution is found in the package golang.org/x/text/unicode/norm. The line above is now two lines:

rune, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(char)))
value = strconv.Itoa(int(rune)) 

Any hints to make this shorter welcome ...

Upvotes: 4

Views: 16029

Answers (3)

icza
icza

Reputation: 418505

The "character" type in Go is the rune which is an alias for int32, see also Rune literals. A rune is an integer value identifying a Unicode code point.

In Go strings are represented and stored as the UTF-8 encoded byte sequence of the text. The range form of the for loop iterates over the runes of the text:

s := "äöüÄÖÜ世界"
for _, r := range s {
    fmt.Printf("%c - %d\n", r, r)
}

Output:

ä - 228
ö - 246
ü - 252
Ä - 196
Ö - 214
Ü - 220
世 - 19990
界 - 30028

Try it on the Go Playground.

Read this blog article if you want to know more about the topic:

Strings, bytes, runes and characters in Go

Upvotes: 8

ANisus
ANisus

Reputation: 78065

Strings are utf8 encoded, so to decode a character from a string to get the rune (unicode code point), you can use the unicode/utf8 package.

Example:

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    str := "AÅÄÖ"

    for len(str) > 0 {
        r, size := utf8.DecodeRuneInString(str)
        fmt.Printf("%d %v\n", r, size)

        str = str[size:]
    }
}

Result:

65 1
197 2
196 2
214 2

Edit: (To clarify Michael's supplement)

A character such as Ä may be created using different unicode code points:

Precomposed: Ä (U+00C4)
Using combining diaeresis: A (U+0041) + ¨ (U+0308)

In order to get the precomposed form, one can use the normalization package, golang.org/x/text/unicode/norm. The NFC (Canonical Decomposition, followed by Canonical Composition) form will turn U+0041 + U+0308 into U+00C4:

c := "\u0041\u0308"
r, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(c)))
fmt.Printf("%+q", r) // '\u00c4'

Upvotes: 12

Xi Sigma
Xi Sigma

Reputation: 2372

you can use the unicode/utf8 package

rune,_:=utf8.DecodeRuneInString("Ä")
fmt.Println(rune)

Upvotes: 6

Related Questions