shodanshok
shodanshok

Reputation: 167

Fastest method to recognize an all-NULL string in python

Suppose a performance-critical code section reads equally-sized data block from a disk file. How can I detect an all-null string/data block in the shortest time possible?

This is my current code:

# options.blocksize = 1024*1024
f, dummy = do_open(dev, 'r')
zeroblock = '\0'*options.blocksize
while True:
    block = f.read(options.blocksize)
    if not block:
        break
    if block == zeroblock:
        csum = "0000"

As you can see, I am comparing an all-null block to the ones read from the file. This method works but, for large blocks, it spend considerable time in the comparison.

I also tried counting NULL occurrences:

# options.blocksize = 1024*1024
f, dummy = do_open(dev, 'r')
zeroblock = '\0'*options.blocksize
while True:
    block = f.read(options.blocksize)
    if not block:
        break
    if block.count('\0') == options.blocksize:
        csum = "0000"

but it is even slower than the first method.

Any suggestion on how to improve performance? Thanks.

Upvotes: 0

Views: 134

Answers (1)

gilch
gilch

Reputation: 11681

Instead of if block == zeroblock: try if not sum(block):. Adding zeros together should be very fast.

if not any(block): should be about as fast, but for sufficiently large blocks may perform better. (It shortcuts on the first nonzero.)

Note that doesn't work on normal unicode strings, only bytestrings b'' because bytestring iterators return ints instead of 1-character strings. This means you have to open() the file in binary mode with 'rb' instead of just 'r'.


Python 2 doesn't have bytes. The old str type it uses instead was byte-based, but the iterator returns length-1 strings instead of integers like we want. So you'll want to use byte arrays instead in Python 2. Either upgrade your Python, or try something like this.

from array import array

f, dummy = do_open(dev, 'rb')
while True:
    block = array('B')  # 'B' means bytes. (Actually "unsigned char" in C.)
    try:
        block.fromfile(f, options.blocksize)
    except EOFError:  # Fewer bytes were left than blocksize.
        pass # Remaining bytes were still appended though.
    if not block:
        break
    if not any(block):  # sum() might be faster depending on blocksize.
        csum = "0000"

You don't need the try/except part if you know the file divides into blocksize evenly.


You might also try array('L') to load the data as unsigned longs instead of as bytes. This would probably quarter the number of iterations required by sum as the array would contain fewer (bigger) elements, but you'd have to make sure that aligns with your blocksize.

Upvotes: 2

Related Questions