rewolf
rewolf

Reputation: 5871

Matching multiple unicode characters in Golang Regexp

As a simplified example, I want to get ^⬛+$ matched against ⬛⬛⬛ to yield a find match of ⬛⬛⬛.

    r := regexp.MustCompile("^⬛+$")
    matches := r.FindString("⬛️⬛️⬛️")
    fmt.Println(matches)

But it doesn't match successfully even though this would work with regular ASCII characters.

I'm guessing there's something I don't know about Unicode matching, but I haven't found any decent explanation in documentation yet.

Can someone explain the problem?

Go Play

Upvotes: 1

Views: 2860

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626896

You need to account for all chars in the string. If you analyze the string you will see it contains:

enter image description here

So you need a regex that will match a string containing one or more combinations of \x{2B1B} and \x{FE0F} chars till end of string.

So you need to use

^(?:\x{2B1B}\x{FE0F})+$

See the regex demo.

Note you can use \p{M} to match any diacritic mark:

^(?:\x{2B1B}\p{M})+$

See the Go demo:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    r := regexp.MustCompile(`^(?:\x{2B1B}\x{FE0F})+$`)
    matches := r.FindString("⬛️⬛️⬛️")
    fmt.Println(matches)
}

Upvotes: 4

Ansible Adams
Ansible Adams

Reputation: 24

The regular expression matches a string containing one or more ⬛ (black square box).

The subject string is three pairs of black square box and variation selector-16. The variation selectors are invisible (on my terminal) and prevent a match.

Fix by removing the variation selectors from the subject string or adding the variation selector to the pattern.

Here's the first fix: https://go.dev/play/p/oKIVnkC7TZ1

Upvotes: 0

Related Questions