regex for the pattern of one optional space before Chinese words in lua

Question

I tried use string.match("Í",'%s?[\u{4e00}-\u{9FFF}]+') which is similar to how we work in JS or others. But it will match one unnecessary character like the above 'Í'.

The official implementation of matching UTF-8 is using eacape \ddd but \u{XXX} seems to fail because

Lua's pattern matching facilities work byte by byte

Temporarily, I use the unstable workaround similar to utf8.charpattern: string.match("Í",'%s?[\228-\233][%z\1-\191][%z\1-\191]') based on the utf8 encoding will output nil and works for checking cjk like '我' although it has one wrong range for the 2nd Byte from left.

Q:

How to solve this problem with regex?

Luatic · Accepted Answer

Lua patterns are not regular expressions. Regular expressions have features that Lua patterns don't have (e.g. grouping, possibly nested, and choice), and Lua patterns have feature that regular expressions (at least in the formal linguistic sense) do not have (e.g. %b, %1).
You are right: Lua patterns do not operate on "code points", they operate on bytes. That's why \u{4e00}-\u{9FFF} doesn't work: What Lua sees here is \228\184\128-\233\191\191, equivalent to \184\191\228\128-\233, which is very different from what you want (notably, the range is suddenly from \128 to \233). I consider the interaction of - with multibyte "characters" that appear as a single code point in the sources a bit of a footgun.

Since you want a pure Lua solution, and given the simplicity of your pattern, a handmade solution is feasible:

local codepoints = {}
for _, c in utf8.codes(s) do
    if utf8.char(c):match"^%s$" and codepoints[1] == nil then
        codepoints[1] = c
    elseif c >= 0x4e00 and c <= 0x9FFF then
        table.insert(codepoints, c)
    else
        codepoints = {}
    end
end
local match = utf8.char(table.unpack(codepoints))
if match:match"^%s?$" then match = nil end -- single space or empty string

Edit: Since you want to check for a full match, this can be simplified:

local match = true
local got_chinese_character = false
for p, c in utf8.codes(s) do
    if c >= 0x4e00 and c <= 0x9FFF then
        got_chinese_character = true
    elseif p > 1 or not utf8.char(c):match"^%s$" then
        -- non-chinese character that is not a leading space
        match = false
        break
    end
end
match = match and got_chinese_character

regex for the pattern of one optional space before Chinese words in lua

Answers (1)

Related Questions