How to read a binary into an array

Question

Say I have a 90 megabyte file. It's not encrypted, but it is binary.

I want to store this file into a table as an array of byte values so I can process the file byte by byte.

I can spare up to 2 GB of ram, so something with a thing like jotting down what bytes have been processed, which bytes have yet to be processed, and the processed bytes, would all be good. I don't exactly care about how long it may take to process.

How should I approach this?

RBerteig · Accepted Answer

Note I've expanded and rewritten this answer due to Egor's comment.

You first need the file open in binary mode. The distinction is important on Windows, where the default text mode will change line endings from CR+LF into C newlines. You do this by specifying a mode argument to io.open of "rb".

Although you can read a file one byte at a time, in practice you will want to work through the file in buffers. Those buffers can be fairly large, but unless you know you are handling only small files in a one-off script, you should avoid reading the entire file into a buffer with file:read"*a" since that will cause various problems with very large files.

Once you have a file open in binary mode, you read a chunk of it using buffer = file:read(n), where n is an integer count of bytes in the chunk. Using a moderately sized power of two will likely be the most efficient. The return value will either be nil, or will be a string of up to n bytes. If less than n bytes long, that was the last buffer in the file. (If reading from a socket, pipe, or terminal, however, reads less than n may only indicate that no data has arrived yet, depending on lots of other factors to complex to explain in this sentence.)

The string in buffer can be processed any number of ways. As long as #buffer is not too big, then {buffer:byte(1,-1)} will return an array of integer byte values for each byte in the buffer. Too big partly depends on how your copy of Lua was configured when it was built, and may depend on other factors such as available memory as well. #buffer > 1E6 is certainly too big. In the example that follows, I used buffer:byte(i) to access each byte one at a time. That works for any size of buffer, at least as long as i remains an integer.

Finally, don't forget to close the file.

Here's a complete example, lightly tested. It reads a file a buffer at a time, and accumulates the total size and the sum of all bytes. It then prints the size, sum, and average byte value.

-- sum all bytes in a file
local name = ...
assert(name, "Usage: "..arg[0].." filename")

file = assert(io.open(name, "rb"))
local sum, len = 0,0
repeat
    local buffer = file:read(1024)
    if buffer then
        len = len + #buffer
        for i = 1, #buffer do
            sum = sum + buffer:byte(i)
        end
    end
until not buffer
file:close()
print("length:",len)
print("sum:",sum)
print("mean:", sum / len)

Run with Lua 5.1.4 on my Windows box using the example as its input, it reports:

length: 402
sum:    30374
mean:   75.557213930348

How to read a binary into an array

Answers (2)

Related Questions