user9264697
user9264697

Reputation:

Fastest method in Python to detect zero-filled bytearray

I have code that reads a binary disk image file in 2MB chunks and saves each chunk as separate files.

My only special requirement is to skip saving the chunk if it contains all zeroes; this is all for the sake of speed and efficiency. I am worried that my current method that uses .count() may not be the most efficient:

with open("source.img", "rb") as src:
  for addr in range(0, sourcesize, chunksize):
    buf = src.read(chunksize)
    with open("imgdir/"+hex(addr), "wb") as dest:
      if len(buf) > buf.count(b"\x00"): # <---this concerns me
        dest.write(buf)

The performance in practice is lackluster. I know Python is not designed for speed, but does it offer any better options? Perhaps a function finding "anything except x00" in the buffer, which should return much earlier on average with fewer iterations?

Upvotes: 1

Views: 1963

Answers (1)

user9264697
user9264697

Reputation:

In the following test loop I was able to reduce execution time by about 25% when comparing the work buffer directly against a zero buffer. I chose this way because it should cause Python to stop checking before reaching the end of the buffer in many iterations:

sourcesize = 2**31 # 2GB
chunksize = 2**21 # 2MB
zeros=bytes(chunksize)

with open("source.img","rb") as source:
  for addr in range(0,sourcesize,chunksize):
    with open("/dev/null", "wb") as dest:
      buf=source.read(chunksize)
      #if len(buf) > buf.count(b"\x00"): # old comparison
      if buf != zeros: # <-faster comparison
        dest.write(buf)

This gets nearly identical results to the test command dd if=source.img of=/dev/null bs=2M conv=sparse which has very similar behavior including a check to skip blocks that are all zeros. Since I assume dd is written in C I feel this is a good result.

Upvotes: 3

Related Questions