Stephan Neuhaus
Stephan Neuhaus

Reputation: 128

Go: How to find out a rune's Unicode properties?

I want to find out a rune's Unicode properties, particularly the value of its script property. Unicode has this to say (in http://www.unicode.org/reports/tr24/ Section 1.5):

The script property assigns a single value to each character, either
explicitly associating it with a particular script, or assigning one
of several specail [sic] values.

Go's unicode package provides me with a way to ask, "Is this rune in script x?", but has no way for me to ask, "In what script is this rune?". I could obviously iterate over all scripts, but that would be wasteful. Is there a cleverer way to find out a rune's script? (I could always implement a self-organising list, but I'm looking for something in the standard go libraries that already does what I want, and that I have overlooked.)

Thanks all!

Upvotes: 4

Views: 568

Answers (2)

icza
icza

Reputation: 418505

Improving PeterSO's answer

PeterSO's answer is nice and clear. It doesn't go easy on memory usage though, as it stores more than a hundred thousand entries in a map, values being of string type. Even though a string value is just a header storing a pointer and a length (see reflect.StringHeader), having so many of them in a map is still multiple MB (like 6 MB)!

Since the number of possible different string values (the different script names) is small (137), we may opt to use a value type byte, which will just be an index in a slice storing the real script names.

This is how it could look like:

var runeScript map[rune]byte

var names = []string{""}

func init() {
    const nChar = 128172 // Version 9.0.0
    runeScript = make(map[rune]byte, nChar*125/100)
    for s, rt := range unicode.Scripts {
        idx := byte(len(names))
        names = append(names, s)
        for _, r := range rt.R16 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = idx
            }
        }
        for _, r := range rt.R32 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = idx
            }
        }
    }
}

func script(r rune) string {
    return names[runeScript[r]]
}

func main() {
    chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
    for _, c := range chars {
        s := script(c)
        fmt.Printf("%q %s\n", c, s)
    }
}

This simple improvement requires only one third of the memory compared to using map[rune]string. Output is the same (try it on the Go Playground):

' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul

Building merged range slices

Using map[rune]byte will result in like 2 MB of RAM usage, and it takes "some" time to build this map, which may or may not be acceptable.

There's another approach / solution. We may opt in to not build a map of "all" runes, but only store a slice of all ranges (actually 2 slices of ranges, one with 16-bit Unicode values, and another with 32-bit Unicode codepoints).

The benefit of this originates from the fact that the number of ranges is much less than the number of runes: only 852 (compared to 100,000+ runes). Memory usage of 2 slices having a total of 852 elements will be negligible compared to solution #1.

In our ranges we also store the script (name), so we can return this info. We could also store only a name index (as in solution #1), but since we only have 852 ranges, it's not worth it.

We'll sort the range slices, so we can use binary search in it (~400 elements in a slice, binary search: we get the result in like 7 steps max, and worst case repeating binary search on both: 15 steps).

Ok, so let's see. We're using these range wrappers:

type myR16 struct {
    r16    unicode.Range16
    script string
}

type myR32 struct {
    r32    unicode.Range32
    script string
}

And store them in:

var allR16 = []*myR16{}
var allR32 = []*myR32{}

We initialize / fill them like this:

func init() {
    for script, rt := range unicode.Scripts {
        for _, r16 := range rt.R16 {
            allR16 = append(allR16, &myR16{r16, script})
        }
        for _, r32 := range rt.R32 {
            allR32 = append(allR32, &myR32{r32, script})
        }
    }

    // sort
    sort.Slice(allR16, func(i int, j int) bool {
        return allR16[i].r16.Lo < allR16[j].r16.Lo
    })
    sort.Slice(allR32, func(i int, j int) bool {
        return allR32[i].r32.Lo < allR32[j].r32.Lo
    })
}

And finally the search in the sorted range slices:

func script(r rune) string {
    // binary search over ranges
    if r <= 0xffff {
        r16 := uint16(r)
        i := sort.Search(len(allR16), func(i int) bool {
            return allR16[i].r16.Hi >= r16
        })

        if i < len(allR16) && allR16[i].r16.Lo <= r16 && r16 <= allR16[i].r16.Hi {
            return allR16[i].script
        }
    }

    r32 := uint32(r)
    i := sort.Search(len(allR32), func(i int) bool {
        return allR32[i].r32.Hi >= r32
    })

    if i < len(allR32) && allR32[i].r32.Lo <= r32 && r32 <= allR32[i].r32.Hi {
        return allR32[i].script
    }

    return ""
}

Note: the Stride is always 1 in all scripts in the unicode package, which I took advantage of (and did not include it in the algorithm).

Testing with the same code, we get the same output. Try it on the Go Playground.

Upvotes: 2

peterSO
peterSO

Reputation: 166855

The easiest and quickest solution is to write the function. For example,

package main

import (
    "fmt"
    "unicode"
)

var runeScript map[rune]string

func init() {
    const nChar = 128172 // Version 9.0.0
    runeScript = make(map[rune]string, nChar*125/100)
    for s, rt := range unicode.Scripts {
        for _, r := range rt.R16 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = s
            }
        }
        for _, r := range rt.R32 {
            for i := r.Lo; i <= r.Hi; i += r.Stride {
                runeScript[rune(i)] = s
            }
        }
    }
}

func script(r rune) string {
    return runeScript[r]
}

func main() {
    chars := []rune{' ', '0', 'a', 'α', 'А', 'ㄱ'}
    for _, c := range chars {
        s := script(c)
        fmt.Printf("%q %s\n", c, s)
    }
}

Output:

$ go run script.go
' ' Common
'0' Common
'a' Latin
'α' Greek
'А' Cyrillic
'ㄱ' Hangul
$ 

Upvotes: 5

Related Questions