Consler
Consler

Reputation: 85

What is the "undetected" symbol in lua

a = string.sub("слово", 1,1) слово is a word in russian which needs 2 bytes for every letter and the output for print(a) is going to something like � depending on where you run it

if a =="?" then

if a =="�" then

if a =="" then if a ==nil then

doesn't work

I just want to know which symbol would work for the if statement

Upvotes: 1

Views: 70

Answers (4)

ESkri
ESkri

Reputation: 1928

To select n-th symbol of UTF-8 string, use utf8.offset (Lua 5.3+ is required):

local s = "слово"
local n = 4
local c = s:match(utf8.charpattern, utf8.offset(s,n))
print(c)  -->  в
if c == "в" then .... end

Upvotes: 1

Nifim
Nifim

Reputation: 5021

с is made of 2 bytes whose values are 209 and 129 so a = string.sub("слово", 1,1) will get the first of those 2 bytes, that means the character you should be checking the of value 209 and you can do this in lua by putting an escape and then entering the value in a string literal.

a = string.sub("слово", 1,1)
if a == '\209' then
    print('Do stuff')
end

Upvotes: 1

Luatic
Luatic

Reputation: 11201

Assuming your Lua source file is UTF-8 encoded, "слово" will be equivalent to "\209\129\208\187\208\190\208\178\208\190" (you can confirm this via ("слово"):gsub(".", function(c) return ("\\%d"):format(c:byte()) end)).

That is, as you observed, the cyrillic characters, not being part of ASCII, will be encoded using multi-byte (in this case specifically two-byte sequences) as UTF-8. This means the first byte is just 209. a is the string consisting of just the first byte, so "\209". Hence, the answer to your question of

I just want to know which symbol would work for the if statement

is just "\209". On its own, this is invalid UTF-8: For it to be valid UTF-8, a second byte in a certain range would need to follow. So if your terminal expects UTF-8 and thus can't decode this, it will display the Unicode replacement character to indicate the invalid encoding. Some terminals may try to be clever about guessing the encoding, displaying other characters instead.

The other conditions you tried didn't work because:

  • "?" is just the ASCII question mark, equivalent to "\63". Obviously, "\63" is not the same string as "\209", since the byte is different. The question mark isn't special in any way.
  • "�" is just the Unicode replacement character. This is what terminals often render when you give them an invalid codepoint. It is not the "real codepoint". Under the hood, it is encoded as "\239\191\189", which is not byte equal to "\209".
  • "" is just the empty string. Being shorter than the one-byte string "\209", it can't be equal.
  • nil won't compare equal to any string, being of a different type.

In general, keep in mind that Lua strings are just byte strings, not strings of "codepoints" or "grapheme clusters". If you substring, you're indexing bytes. If you want to index codepoints instead, use the utf8 library, and ensure that your strings are UTF-8 encoded.

Upvotes: 2

Gianni
Gianni

Reputation: 4390

Well, the output of the print() commands is outside of lua. So it might be different depending on your terminal. I have a case where a program prints a UTF character with 3 bytes ok in Konsole, but the same program prints garbage in GNOME.

Upvotes: 1

Related Questions