Reputation: 93
Say I have a 90 megabyte file. It's not encrypted, but it is binary.
I want to store this file into a table as an array of byte values so I can process the file byte by byte.
I can spare up to 2 GB of ram, so something with a thing like jotting down what bytes have been processed, which bytes have yet to be processed, and the processed bytes, would all be good. I don't exactly care about how long it may take to process.
How should I approach this?
Upvotes: 0
Views: 1082
Reputation: 43296
Note I've expanded and rewritten this answer due to Egor's comment.
You first need the file open in binary mode. The distinction is important on Windows, where the default text mode will change line endings from CR+LF into C newlines. You do this by specifying a mode argument to io.open
of "rb"
.
Although you can read a file one byte at a time, in practice you will want to work through the file in buffers. Those buffers can be fairly large, but unless you know you are handling only small files in a one-off script, you should avoid reading the entire file into a buffer with file:read"*a"
since that will cause various problems with very large files.
Once you have a file open in binary mode, you read a chunk of it using buffer = file:read(n)
, where n
is an integer count of bytes in the chunk. Using a moderately sized power of two will likely be the most efficient. The return value will either be nil
, or will be a string of up to n
bytes. If less than n
bytes long, that was the last buffer in the file. (If reading from a socket, pipe, or terminal, however, reads less than n
may only indicate that no data has arrived yet, depending on lots of other factors to complex to explain in this sentence.)
The string in buffer
can be processed any number of ways. As long as #buffer
is not too big, then {buffer:byte(1,-1)}
will return an array of integer byte values for each byte in the buffer. Too big partly depends on how your copy of Lua was configured when it was built, and may depend on other factors such as available memory as well. #buffer > 1E6
is certainly too big. In the example that follows, I used buffer:byte(i)
to access each byte one at a time. That works for any size of buffer, at least as long as i
remains an integer.
Finally, don't forget to close the file.
Here's a complete example, lightly tested. It reads a file a buffer at a time, and accumulates the total size and the sum of all bytes. It then prints the size, sum, and average byte value.
-- sum all bytes in a file
local name = ...
assert(name, "Usage: "..arg[0].." filename")
file = assert(io.open(name, "rb"))
local sum, len = 0,0
repeat
local buffer = file:read(1024)
if buffer then
len = len + #buffer
for i = 1, #buffer do
sum = sum + buffer:byte(i)
end
end
until not buffer
file:close()
print("length:",len)
print("sum:",sum)
print("mean:", sum / len)
Run with Lua 5.1.4 on my Windows box using the example as its input, it reports:
length: 402 sum: 30374 mean: 75.557213930348
Upvotes: 1
Reputation: 72312
To split the contents of a string s
into an array of bytes use {s:byte(1,-1)}
.
Upvotes: 0