Reputation: 143
Is it possible to read one UTF-8 character from file?
file:read(1) return weird characters instead, when i print it.
function firstLetter(str)
return str:match("[%z\1-\127\194-\244][\128-\191]*")
end
Function returns one UTF-8 character from string str. I need to read one UTF-8 character this way, but from input file (don't want to read certain file to the memory - via file:read("*all"))
Question is pretty similar to this post: Extract the first letter of a UTF-8 string with Lua
Upvotes: 5
Views: 2941
Reputation: 69924
In the UTF-8 encoding the number of bytes taken for a character is determined by the first byte of that character, according to the following table (taken from RFC 3629:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
If the highest bit of the first byte is "0", then the character has only one byte. If the highest bits are "110" then the character has 2 bytes and so on.
What you can then do is read one byte from the file and determine how many continuation bytes you need to read to the the full UTF-8 character:
function get_one_utf8_character(file)
local c1 = file:read(1)
if not c1 then return nil end
local ncont
if c1:match("[\000-\127]") then ncont = 0
elseif c1:match("[\192-\223]") then ncont = 1
elseif c1:match("[\224-\239]") then ncont = 2
elseif c1:match("[\240-\247]") then ncont = 3
else
return nil, "invalid leading byte"
end
local bytes = { c1 }
for i=1,ncont do
local ci = file:read(1)
if not (ci and ci:match("[\128-\191]")) then
return nil, "expected continuation byte"
end
bytes[#bytes+1] = ci
end
return table.concat(bytes)
end
Upvotes: 0
Reputation: 26744
You need to read characters so that the string you are matching always has four or more of them (which will allow you to apply the logic from the answer you referenced). If after matching and removing a UTF-8 character then length is len
, you then read from the file 4-len
characters.
ZeroBrane Studio replaces invalid UTF-8 characters with [SYN]
character when printed to the Output panel (as you see in the screenshot). This blogpost describes the logic behind the detection of invalid UTF-8 characters (in Lua) and their handling in ZeroBrane Studio.
Upvotes: 0
Reputation: 23727
function read_utf8_char(file)
local c1 = file:read(1)
local ctr, c = -1, math.max(c1:byte(), 128)
repeat
ctr = ctr + 1
c = (c - 128)*2
until c < 128
return c1..file:read(ctr)
end
Upvotes: 3