CrazySqueak
CrazySqueak

Reputation: 611

Is there a way to speed up reading and handling large (unstructured) binary files?

I'm writing a program that encrypts files, however the speed is horrendous, running at about 1.5s for every 1MiB block. Is there any way I can speed this up?

I'm using Python 3.x, and the current encryption method converts the data passed to base64 before encrypting it. Each file is (at the moment) split into 1MiB blocks and written to the destination directory in those blocks. self.ep refers to the unencrypted directory, and self.sp_bp refers to the folder that each encrypted is saved to. mdata is the dictionary containing the metadata. I've tried increasing the block size, which had little effect, and the write=True bit was added to prevent writing blocks over ones that are the same in an attempt to fix the problem.

for fn in files:
    print("File: {}".format(fn))
    fp = os.path.join(root,fn)
    rfp = self.getRelativePath(fp,self.ep)
    rfp = self.e.encryptString(rfp.encode("utf-8"),key)
    mdata["files"][rfp] = []
    with open(fp,"rb") as f:
        buf = f.read(self.BLOCKSIZE)
        while len(buf) > 0:
            blockno += 1
            mdata["totalblocks"] += 1
            print("  Block: {}".format(blockno))
            mdata["files"][rfp].append(blockno)

            buf = self.e.encryptString(buf,key).encode("utf-8")

            hasher = hashlib.sha512()
            hasher.update(buf)
            hash = hasher.hexdigest()
            mdata["blockhashes"][blockno] = hash

            write = True
            if os.path.exists(os.path.join(self.sp_bp,"block{}".format(blockno))):
                with open(os.path.join(self.sp_bp,"block{}".format(blockno)),"rb") as bf:
                    otherblk = bf.read()
                if buf == otherblk:
                    write = False

            if write:
                with open(os.path.join(self.sp_bp,"block{}".format(blockno)),"wb") as bf:
                    bf.write(buf)
            buf = f.read(self.BLOCKSIZE)

As mentioned, the encryption runs at about 1.5 seconds per Mebibyte (1024^2 bytes), but that is way too slow when handling large files.

Edit: Shared some useful information. self.BLOCKSIZE is equal to 1024*1024 or 1048576. This is equal to the amount of bytes in one MiB. The 'os.path.join(self.sp_bp,"block{}".format(blockno))' part is a snippet designed to convert the block number to the valid filename for storage in the 'vault'. blockno is equal to the current block number, and self.sp_bp is the path to the folder that the encrypted blocks are being stored (the 'vault'). There are no temporary files used, only the original input files (unencrypted), and the encrypted 'blocks'.

Upvotes: 0

Views: 173

Answers (1)

CrazySqueak
CrazySqueak

Reputation: 611

I've figured out the cause of the problem. The encryption routine runs a for loop for every character in the converted base64. For loops take ages to run, and so running the loop for every single character was extremely time consuming.

I'm working on a way to speed up encryption by grouping characters to drastically lower the number of iterations.

Upvotes: 0

Related Questions