Reputation: 45
I need to identify the correct indexes in twitter messages (various languages, emojis, etc).
I can't find a solution that returns these positions as shown in the example below.
import (
"regexp"
"testing"
"github.com/stretchr/testify/require"
)
func TestA(t *testing.T) {
text := "🇷🇺 [URGENT] Les forces de dissuasion #nucleaire de la #Russie"
var re = regexp.MustCompile(`#([_A-Za-z0-9]+)`)
pos := re.FindAllStringIndex(text, -1)
// FindAllStringIndex returns
// [0][43,53]
// [1][60,67]
// These are the expected positions.
require.Equal(t, pos[0][0], 37)
require.Equal(t, pos[0][1], 47)
require.Equal(t, pos[1][0], 54)
require.Equal(t, pos[1][1], 61)
}
Upvotes: 2
Views: 162
Reputation: 627607
The FindAllStringIndex()
function returns the position of bytes, not runes.
You need to import "unicode/utf8"
and use utf8.RuneCountInString(text[:pos[0][0]])
and so on instead of pos[0][0]
to make sure you count the Unicode code points and not just bytes:
// You can edit this code!
// Click here and start typing.
package main
import (
"regexp"
"testing"
"unicode/utf8"
"github.com/stretchr/testify/require"
)
func TestA(t *testing.T) {
text := "🇷🇺 [URGENT] Les forces de dissuasion #nucleaire de la #Russie"
var re = regexp.MustCompile(`#\w+`)
pos := re.FindAllStringIndex(text, -1)
require.Equal(t, utf8.RuneCountInString(text[:pos[0][0]]), 37)
require.Equal(t, utf8.RuneCountInString(text[:pos[0][1]]), 47)
require.Equal(t, utf8.RuneCountInString(text[:pos[1][0]]), 54)
require.Equal(t, utf8.RuneCountInString(text[:pos[1][1]]), 61)
}
See the Go demo.
Also, #\w+
is a a shorter pattern to match a #
and then one or more letters, digits or underscores.
Upvotes: 2