Reputation:
I have code that reads a binary disk image file in 2MB chunks and saves each chunk as separate files.
My only special requirement is to skip saving the chunk if it contains all zeroes; this is all for the sake of speed and efficiency. I am worried that my current method that uses .count() may not be the most efficient:
with open("source.img", "rb") as src:
for addr in range(0, sourcesize, chunksize):
buf = src.read(chunksize)
with open("imgdir/"+hex(addr), "wb") as dest:
if len(buf) > buf.count(b"\x00"): # <---this concerns me
dest.write(buf)
The performance in practice is lackluster. I know Python is not designed for speed, but does it offer any better options? Perhaps a function finding "anything except x00" in the buffer, which should return much earlier on average with fewer iterations?
Upvotes: 1
Views: 1963
Reputation:
In the following test loop I was able to reduce execution time by about 25% when comparing the work buffer directly against a zero buffer. I chose this way because it should cause Python to stop checking before reaching the end of the buffer in many iterations:
sourcesize = 2**31 # 2GB
chunksize = 2**21 # 2MB
zeros=bytes(chunksize)
with open("source.img","rb") as source:
for addr in range(0,sourcesize,chunksize):
with open("/dev/null", "wb") as dest:
buf=source.read(chunksize)
#if len(buf) > buf.count(b"\x00"): # old comparison
if buf != zeros: # <-faster comparison
dest.write(buf)
This gets nearly identical results to the test command dd if=source.img of=/dev/null bs=2M conv=sparse
which has very similar behavior including a check to skip blocks that are all zeros. Since I assume dd
is written in C I feel this is a good result.
Upvotes: 3