Keir
Keir

Reputation: 365

Golang regexp with non-latin characters

I am parsing words from some sentences and my \w+ regexp works fine with Latin characters. However, it totally fails with some Cyrillic characters.

Here is a sample app:

package main

import (
    "fmt"
    "regexp"
)

func get_words_from(text string) []string {
    words := regexp.MustCompile("\\w+")
    return words.FindAllString(text, -1)
}

func main() {
    text := "One, two three!"
    text2 := "Раз, два три!"
    text3 := "Jedna, dva tři čtyři pět!"
    fmt.Println(get_words_from(text))
    fmt.Println(get_words_from(text2))
    fmt.Println(get_words_from(text3))
}

It yields the following results:

 [One two three]
 []
 [Jedna dva t i ty i p t]

It returns empty values for Russian, and extra syllables for Czech. I have no idea how to solve this issue. Could someone give me a piece of advice?

Or maybe there is a better way to split a sentence into words without punctuation?

Upvotes: 25

Views: 13360

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626893

The \w shorthand class only matches ASCII letters in GO regex, thus, you need a Unicode character class \p{L}.

\w word characters (== [0-9A-Za-z_])

Use a character class to include the digits and underscore:

    regexp.MustCompile("[\\p{L}\\d_]+")

Output of the demo:

[One two three]
[Раз два три]
[Jedna dva tři čtyři pět]

Upvotes: 33

Related Questions