Efficient Way to Read File Byte by Byte and Convert to Int Array?

Question

I'm trying to read in a PDF source file, and append each individual byte to an array of 8-bit integers. This is the slowest function in my program, and I was wondering if there was a more efficient method to do so, or if my process is completely wrong and I should go about it another way. Thanks in advance!

pagesize = 4096
arr = []
doc = ""
with open(filename, 'rb') as f:
    while doc != b'':
        doc = f.read(pagesize)
        for b in doc:
            arr.append(b)

abarnert · Accepted Answer

A bytes object is already a sequence of 8-bit integers:

>>> b = b'abc'
>>> for byte in b: print(byte)
97
98
99

If you want to convert it to a different kind of sequence, like a list, just call the constructor:

>>> lst = list(b)
>>> lst
[97, 98, 99]
>>> arr = array.array('b', a)
>>> arr
array('b', [97, 98, 99])

Or, if you need to do it a chunk at a time for some reason, just pass the whole chunk to extend:

>>> arr = list(b'abc')
>>> arr.extend(b'def')
>>> arr
[97, 98, 99, 100, 101, 102]

However, the most efficient thing to do is just leave it in a bytes:

with open(filename, 'rb') as f:
    arr = f.read()

… or, if you need it to be mutable, use bytearray:¹

pagesize=4096
arr = bytearray()
with open(filename, 'rb') as f:
    arr.extend(f.read(4096))

… or, if there's any chance you could benefit from speeding up elementwise operations over the whole array, use NumPy:

with open(filename, 'rb') as f:
    arr = np.fromfile(f, dtype=np.uint8)

Or, don't even read the file in the first place and instead mmap it, then use the mmap as your sequence of integers.²

with open(filename, 'rb') as f:
    arr = mmap.mmap(f, 0)

For comparison, under the covers (at least in CPython):

A bytes (or bytearray, or array.array('b'), or np.array(dtype=np.int8), etc.) is stored as an array of 8-bit integers. So, 1M bytes takes 1MB.
- A bytearray will have a bit of extra slack at the end, increasing the size by about 6%. So, 1M bytes takes 1.06MB.
A general-purpose sequence like a tuple or list is stored as an array of pointers to objects wrapping the 8-bit integers. The objects don't matter (there's only going to be one copy for each of the 256 values, no matter how many references there are to each), but the pointers are 8 bytes (4 bytes in 32-bit builds). So, 1M bytes takes 8MB.
- A list has the same extra slack as bytearray, so it's 8.48MB.
A mmap is like a bytes or array.array('b') as far as virtual memory goes, but any pages that you haven't read or written may not be mapped into physical memory at all. So, 1M bytes takes at most 1MB, but could take as little as 8KB.

_{1. You can speed this up. If you pre-extend the bytearray 4K at a time—or, even better, pre-allocate the whole thing, if you know the length of the file—you can readinto a memoryview over a slice of the bytearray. But this is more complicated, and probably not worth it—if you need this, you should probably have been using either numpy or an mmap.}

_{2. This does mean that you have to move all your arr-using code inside the with, or otherwise keep the file open as long as you need the data. Because the file itself is the storage for your "array"; you haven't copied the bytes into different storage in memory.}

Efficient Way to Read File Byte by Byte and Convert to Int Array?

Answers (1)

Related Questions