How to read a hex file into numpy array

Question

I want to load the following hexadecimal file that has

1) initializing value (IV) on the first line,
2) encrption key on the second line,
3) number of plain texts on the third line, and
4) actual plain texts for AES encryption in Cipher Block Chaining (CBC) mode

into numpy array.

6bce1cb8d64153f82570751b6653c943
b15a65475a91774a45106fbc28f0df70
10
f493befb2dcad5118d523a4a4bf4a504
54fc4e0a82ae8dc56cc7befc9994b79d
878d287647b457fd95d40691b6e0c8ab
dc0adc16665eb96a15d3257752ae67dc
8cda3b8f23d38e9240b9a89587f69970
e06301763146c1bac24619e61015f481
c19def2f12e5707d89539e18ad104937
048d734a1a36d4346edc7ceda07ff171
5e621ce0a570478c1c2ec3e557ca3e0d
e55c57b119ff922b7f87db0ead2006cd

If the uniformity of the file is bothering you, you may ignore the third line which tells about the number of plain texts to be encrypted. All lines except the third line are 128 bit hexadecimal entries

The idea is to load this file into numpy array and then do AES encryption efficiently.

How can i load this into numpy array and then use AES from Crypto.Cipher to do AES encryption of this file and similar files. I have files of this format having as large as 100 million plain texts.

Thanks and please let me know if you have any questions

abarnert · Accepted Answer

I'm assuming you want to unhexlify the data, and store the resulting bytestrings as fixed-length character strings rather than object. (You can't store them as some kind of int128 type, because numpy doesn't have such a type.)

To avoid reading 3.2GB of text into memory, and using roughly the same amount pre-processing it into the desired form, you probably want to use fromiter, so:

with open(myfile) as f:
    iv = binascii.unhexlify(f.readline().strip())
    key = binascii.unhexlify(f.readline().strip())
    count = int(f.readline())
    a = np.fromiter((binascii.unhexlify(line.strip()) for line in f), dtype='|S16')

If you have 10GB of RAM to spare (rough ballpark guess), it might be faster to read the whole thing in as an array of object, then transform it twice… but I doubt it.

As to whether this will help… You might get a little benefit, because AES-ing 16 bytes may be fast enough that the cost of iteration is noticeable. Let's test it and see.

With 64-bit Mac Python 2.7.2, I created an array of 100000 S16s by copying your example repeatedly. Then:

In [514]: %timeit [aes.encrypt(x) for x in a]
10 loops, best of 3: 166 ms per loop
In [515]: %timeit np.vectorize(aes.encrypt)(a)
10 loops, best of 3: 126 ms per loop

So, that's almost a 25% savings. Not bad.

Of course the array takes longer to build than just keeping things in an iterator in the first place—but even taking that into account, there's still a 9% performance gain. And it may well be reasonable to trade 1.6GB for a 9% speedup in your use case.

Keep in mind that I'm only building an array of 100K objects out of a pre-existing 100K list; with 100M objects read off disk, I/O is probably going to become a serious factor, and it's quite possible that iterative processing (which allows you to interleave the CPU costs with disk waits) will do a whole lot better.

In other words, you need to test with your own real data and scenario. But you already knew that.

For a wider variety of implementations, with a simple perf-testing scaffolding, see this pastebin.

You might want to try combining different approaches. For example, you can use the grouper recipe from itertools to batch things into, say, 32K plaintexts at a time, then process each batch with numpy, to get the best of both. And then pool.imap that numpy processing, to get the best of all 3. Or, alternatively, put the one big numpy array into shared memory, and make each multiprocessing task process a slice of that array.

How to read a hex file into numpy array

Answers (1)

Related Questions