Reputation: 128
I want to find out a rune's Unicode properties, particularly the value of its script property. Unicode has this to say (in http://www.unicode.org/reports/tr24/ Section 1.5):
The script property assigns a single value to each character, either
explicitly associating it with a particular script, or assigning one
of several specail [sic] values.
Go's unicode
package provides me with a way to ask, "Is this rune in script x?", but has no way for me to ask, "In what script is this rune?". I could obviously iterate over all scripts, but that would be wasteful. Is there a cleverer way to find out a rune's script? (I could always implement a self-organising list, but I'm looking for something in the standard go libraries that already does what I want, and that I have overlooked.)
Thanks all!
Upvotes: 4
Views: 568
Reputation: 418505
PeterSO's answer is nice and clear. It doesn't go easy on memory usage though, as it stores more than a hundred thousand entries in a map, values being of string
type. Even though a string
value is just a header storing a pointer and a length (see reflect.StringHeader
), having so many of them in a map is still multiple MB (like 6 MB)!
Since the number of possible different string
values (the different script names) is small (137), we may opt to use a value type byte
, which will just be an index in a slice storing the real script names.
This is how it could look like:
var runeScript map[rune]byte
var names = []string{""}
func init() {
const nChar = 128172 // Version 9.0.0
runeScript = make(map[rune]byte, nChar*125/100)
for s, rt := range unicode.Scripts {
idx := byte(len(names))
names = append(names, s)
for _, r := range rt.R16 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = idx
}
}
for _, r := range rt.R32 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = idx
}
}
}
}
func script(r rune) string {
return names[runeScript[r]]
}
func main() {
chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
for _, c := range chars {
s := script(c)
fmt.Printf("%q %s\n", c, s)
}
}
This simple improvement requires only one third of the memory compared to using map[rune]string
. Output is the same (try it on the Go Playground):
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
Using map[rune]byte
will result in like 2 MB of RAM usage, and it takes "some" time to build this map, which may or may not be acceptable.
There's another approach / solution. We may opt in to not build a map of "all" runes, but only store a slice of all ranges (actually 2 slices of ranges, one with 16-bit Unicode values, and another with 32-bit Unicode codepoints).
The benefit of this originates from the fact that the number of ranges is much less than the number of runes: only 852 (compared to 100,000+ runes). Memory usage of 2 slices having a total of 852 elements will be negligible compared to solution #1.
In our ranges we also store the script (name), so we can return this info. We could also store only a name index (as in solution #1), but since we only have 852 ranges, it's not worth it.
We'll sort the range slices, so we can use binary search in it (~400 elements in a slice, binary search: we get the result in like 7 steps max, and worst case repeating binary search on both: 15 steps).
Ok, so let's see. We're using these range wrappers:
type myR16 struct {
r16 unicode.Range16
script string
}
type myR32 struct {
r32 unicode.Range32
script string
}
And store them in:
var allR16 = []*myR16{}
var allR32 = []*myR32{}
We initialize / fill them like this:
func init() {
for script, rt := range unicode.Scripts {
for _, r16 := range rt.R16 {
allR16 = append(allR16, &myR16{r16, script})
}
for _, r32 := range rt.R32 {
allR32 = append(allR32, &myR32{r32, script})
}
}
// sort
sort.Slice(allR16, func(i int, j int) bool {
return allR16[i].r16.Lo < allR16[j].r16.Lo
})
sort.Slice(allR32, func(i int, j int) bool {
return allR32[i].r32.Lo < allR32[j].r32.Lo
})
}
And finally the search in the sorted range slices:
func script(r rune) string {
// binary search over ranges
if r <= 0xffff {
r16 := uint16(r)
i := sort.Search(len(allR16), func(i int) bool {
return allR16[i].r16.Hi >= r16
})
if i < len(allR16) && allR16[i].r16.Lo <= r16 && r16 <= allR16[i].r16.Hi {
return allR16[i].script
}
}
r32 := uint32(r)
i := sort.Search(len(allR32), func(i int) bool {
return allR32[i].r32.Hi >= r32
})
if i < len(allR32) && allR32[i].r32.Lo <= r32 && r32 <= allR32[i].r32.Hi {
return allR32[i].script
}
return ""
}
Note: the Stride
is always 1 in all scripts in the unicode
package, which I took advantage of (and did not include it in the algorithm).
Testing with the same code, we get the same output. Try it on the Go Playground.
Upvotes: 2
Reputation: 166855
The easiest and quickest solution is to write the function. For example,
package main
import (
"fmt"
"unicode"
)
var runeScript map[rune]string
func init() {
const nChar = 128172 // Version 9.0.0
runeScript = make(map[rune]string, nChar*125/100)
for s, rt := range unicode.Scripts {
for _, r := range rt.R16 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = s
}
}
for _, r := range rt.R32 {
for i := r.Lo; i <= r.Hi; i += r.Stride {
runeScript[rune(i)] = s
}
}
}
}
func script(r rune) string {
return runeScript[r]
}
func main() {
chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
for _, c := range chars {
s := script(c)
fmt.Printf("%q %s\n", c, s)
}
}
Output:
$ go run script.go
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
$
Upvotes: 5