Reputation: 153
I'm trying to read in a PDF source file, and append each individual byte to an array of 8-bit integers. This is the slowest function in my program, and I was wondering if there was a more efficient method to do so, or if my process is completely wrong and I should go about it another way. Thanks in advance!
pagesize = 4096
arr = []
doc = ""
with open(filename, 'rb') as f:
while doc != b'':
doc = f.read(pagesize)
for b in doc:
arr.append(b)
Upvotes: 1
Views: 2451
Reputation: 365905
A bytes
object is already a sequence of 8-bit integers:
>>> b = b'abc'
>>> for byte in b: print(byte)
97
98
99
If you want to convert it to a different kind of sequence, like a list
, just call the constructor:
>>> lst = list(b)
>>> lst
[97, 98, 99]
>>> arr = array.array('b', a)
>>> arr
array('b', [97, 98, 99])
Or, if you need to do it a chunk at a time for some reason, just pass the whole chunk to extend
:
>>> arr = list(b'abc')
>>> arr.extend(b'def')
>>> arr
[97, 98, 99, 100, 101, 102]
However, the most efficient thing to do is just leave it in a bytes
:
with open(filename, 'rb') as f:
arr = f.read()
… or, if you need it to be mutable, use bytearray
:1
pagesize=4096
arr = bytearray()
with open(filename, 'rb') as f:
arr.extend(f.read(4096))
… or, if there's any chance you could benefit from speeding up elementwise operations over the whole array, use NumPy:
with open(filename, 'rb') as f:
arr = np.fromfile(f, dtype=np.uint8)
Or, don't even read
the file in the first place and instead mmap
it, then use the mmap
as your sequence of integers.2
with open(filename, 'rb') as f:
arr = mmap.mmap(f, 0)
For comparison, under the covers (at least in CPython):
bytes
(or bytearray
, or array.array('b')
, or np.array(dtype=np.int8)
, etc.) is stored as an array of 8-bit integers. So, 1M bytes takes 1MB.
bytearray
will have a bit of extra slack at the end, increasing the size by about 6%. So, 1M bytes takes 1.06MB.tuple
or list
is stored as an array of pointers to objects wrapping the 8-bit integers. The objects don't matter (there's only going to be one copy for each of the 256 values, no matter how many references there are to each), but the pointers are 8 bytes (4 bytes in 32-bit builds). So, 1M bytes takes 8MB.
list
has the same extra slack as bytearray
, so it's 8.48MB.mmap
is like a bytes
or array.array('b')
as far as virtual memory goes, but any pages that you haven't read or written may not be mapped into physical memory at all. So, 1M bytes takes at most 1MB, but could take as little as 8KB.1. You can speed this up. If you pre-extend the bytearray
4K at a time—or, even better, pre-allocate the whole thing, if you know the length of the file—you can readinto
a memoryview
over a slice of the bytearray
. But this is more complicated, and probably not worth it—if you need this, you should probably have been using either numpy or an mmap
.
2. This does mean that you have to move all your arr
-using code inside the with
, or otherwise keep the file open as long as you need the data. Because the file itself is the storage for your "array"; you haven't copied the bytes into different storage in memory.
Upvotes: 7