Reputation: 85
a = string.sub("слово", 1,1)
слово is a word in russian which needs 2 bytes for every letter and the output for
print(a)
is going to something like � depending on where you run it
if a =="?" then
if a =="�" then
if a =="" then
if a ==nil then
doesn't work
I just want to know which symbol would work for the if statement
Upvotes: 1
Views: 70
Reputation: 1928
To select n-th symbol of UTF-8 string, use utf8.offset
(Lua 5.3+ is required):
local s = "слово"
local n = 4
local c = s:match(utf8.charpattern, utf8.offset(s,n))
print(c) --> в
if c == "в" then .... end
Upvotes: 1
Reputation: 5021
с
is made of 2 bytes whose values are 209
and 129
so a = string.sub("слово", 1,1)
will get the first of those 2 bytes, that means the character you should be checking the of value 209
and you can do this in lua by putting an escape and then entering the value in a string literal.
a = string.sub("слово", 1,1)
if a == '\209' then
print('Do stuff')
end
Upvotes: 1
Reputation: 11201
Assuming your Lua source file is UTF-8 encoded, "слово"
will be equivalent to "\209\129\208\187\208\190\208\178\208\190"
(you can confirm this via ("слово"):gsub(".", function(c) return ("\\%d"):format(c:byte()) end)
).
That is, as you observed, the cyrillic characters, not being part of ASCII, will be encoded using multi-byte (in this case specifically two-byte sequences) as UTF-8. This means the first byte is just 209
. a
is the string consisting of just the first byte, so "\209"
. Hence, the answer to your question of
I just want to know which symbol would work for the if statement
is just "\209"
. On its own, this is invalid UTF-8: For it to be valid UTF-8, a second byte in a certain range would need to follow. So if your terminal expects UTF-8 and thus can't decode this, it will display the Unicode replacement character to indicate the invalid encoding. Some terminals may try to be clever about guessing the encoding, displaying other characters instead.
The other conditions you tried didn't work because:
"?"
is just the ASCII question mark, equivalent to "\63"
. Obviously, "\63"
is not the same string as "\209"
, since the byte is different. The question mark isn't special in any way."�"
is just the Unicode replacement character. This is what terminals often render when you give them an invalid codepoint. It is not the "real codepoint". Under the hood, it is encoded as "\239\191\189"
, which is not byte equal to "\209"
.""
is just the empty string. Being shorter than the one-byte string "\209"
, it can't be equal.nil
won't compare equal to any string, being of a different type.In general, keep in mind that Lua strings are just byte strings, not strings of "codepoints" or "grapheme clusters". If you substring, you're indexing bytes. If you want to index codepoints instead, use the utf8
library, and ensure that your strings are UTF-8 encoded.
Upvotes: 2
Reputation: 4390
Well, the output of the print() commands is outside of lua. So it might be different depending on your terminal. I have a case where a program prints a UTF character with 3 bytes ok in Konsole, but the same program prints garbage in GNOME.
Upvotes: 1