Reputation: 7123
I try to get the unicode value of a string character in Go as an Int value.
I do this:
value = strconv.Itoa(int(([]byte(char))[0]))
where char contains a string with one character.
That works for many cases. It doesn't work for umlauts like ä, ö, ü, Ä, Ö, Ü.
E.g. Ä results in 65, which is the same as for A.
How can I do that?
Supplement: I had two problems. The first was solved with any of the answers below. The second was a bit more tricky. My input was not Go normalized UTF-8 code, e.g. umlauts were represented by two characters instead of one. As ANisus said the solution is found in the package golang.org/x/text/unicode/norm. The line above is now two lines:
rune, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(char)))
value = strconv.Itoa(int(rune))
Any hints to make this shorter welcome ...
Upvotes: 4
Views: 16029
Reputation: 418505
The "character" type in Go is the rune
which is an alias for int32
, see also Rune literals. A rune
is an integer value identifying a Unicode code point.
In Go string
s are represented and stored as the UTF-8 encoded byte sequence of the text. The range
form of the for
loop iterates over the rune
s of the text:
s := "äöüÄÖÜ世界"
for _, r := range s {
fmt.Printf("%c - %d\n", r, r)
}
Output:
ä - 228
ö - 246
ü - 252
Ä - 196
Ö - 214
Ü - 220
世 - 19990
界 - 30028
Try it on the Go Playground.
Read this blog article if you want to know more about the topic:
Strings, bytes, runes and characters in Go
Upvotes: 8
Reputation: 78065
Strings are utf8 encoded, so to decode a character from a string to get the rune
(unicode code point), you can use the unicode/utf8
package.
Example:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
str := "AÅÄÖ"
for len(str) > 0 {
r, size := utf8.DecodeRuneInString(str)
fmt.Printf("%d %v\n", r, size)
str = str[size:]
}
}
Result:
65 1
197 2
196 2
214 2
Edit: (To clarify Michael's supplement)
A character such as Ä
may be created using different unicode code points:
Precomposed: Ä
(U+00C4)
Using combining diaeresis: A
(U+0041) + ¨
(U+0308)
In order to get the precomposed form, one can use the normalization package, golang.org/x/text/unicode/norm
. The NFC (Canonical Decomposition,
followed by Canonical Composition) form will turn U+0041 + U+0308 into U+00C4:
c := "\u0041\u0308"
r, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(c)))
fmt.Printf("%+q", r) // '\u00c4'
Upvotes: 12
Reputation: 2372
you can use the unicode/utf8
package
rune,_:=utf8.DecodeRuneInString("Ä")
fmt.Println(rune)
Upvotes: 6