Reputation: 41
I am unpacking large binary files (~1GB) with many different datatypes. I am in the early stages of creating the loop to covert each byte. I have been using struct.unpack, but recently thought it would run faster if I utilized numpy. However switching to numpy has slowed down my program. I have tried:
struct.unpack
np.fromfile
np.frombuffer
np.ndarray
note:in the np.fromfile method I leave the file open and don't load it into memory and seek through it
1)
with open(file="file_loc" , mode='rb') as file:
RAW = file.read()
byte=0
len = len(RAW)
while( byte < len):
header = struct.unpack(">HHIH", RAW[byte:(byte+10)])
size = header[1]
loc = str(header[3])
data[loc] = struct.unpack(">B", RAW[byte+10:byte+size-10)
byte+=size
2)
dt=('>u2,>u2,>u4,>u2')
with open(file="file_loc" , mode='rb') as RAW:
same loop as above:
header = np.fromfile(RAW[byte:byte+10], dtype=dt, count=1)[0]
data = np.fromfile(RAW[byte+10:byte+size-10], dtype=">u1", count=size-10)
3)
dt=('>u2,>u2,>u4,>u2')
with open(file="file_loc" , mode='rb') as file:
RAW = file.read()
same loop:
header = np.ndarray(buffer=RAW[byte:byte+10], dtype=dt_header, shape= 1)[0]
data = np.ndarray(buffer=RAW[byte+10:byte+size-10], dtype=">u1", shape=size-10)
4) pretty much the same as 3 except using np.frombuffer()
All of the numpy implementations process at about half the speed as the struct.unpack method, which is not what I expected.
Let me know if there is anything I can do to improve performance.
also, I just typed this from memory so it might have some errors.
Upvotes: 4
Views: 4057
Reputation: 231530
I haven't used struct
much, but between your code and docs I got it to work on a buffer that stores an array of integers.
Make a byte array/string from a numpy
array.
In [81]: arr = np.arange(1000)
In [82]: barr = arr.tobytes()
In [83]: type(barr)
Out[83]: bytes
In [84]: len(barr)
Out[84]: 8000
The reverse is tobytes
:
In [85]: x = np.frombuffer(barr, dtype=int)
In [86]: x[:10]
Out[86]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [87]: np.allclose(x,arr)
Out[87]: True
ndarray
also works, though the direct use of this constructor is usually discouraged:
In [88]: x = np.ndarray(buffer=barr, dtype=int, shape=(1000,))
In [89]: np.allclose(x,arr)
Out[89]: True
To use struct
I need to create a format that includes the length, "1000 long":
In [90]: tup = struct.unpack('1000l', barr)
In [91]: len(tup)
Out[91]: 1000
In [92]: tup[:10]
Out[92]: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
In [93]: np.allclose(np.array(tup),arr)
Out[93]: True
So now that we've established equivalent methods of reading the buffer, do some timings:
In [94]: timeit x = np.frombuffer(barr, dtype=int)
617 ns ± 0.806 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [95]: timeit x = np.ndarray(buffer=barr, dtype=int, shape=(1000,))
1.11 µs ± 1.76 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [96]: timeit tup = struct.unpack('1000l', barr)
19 µs ± 38.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [97]: timeit tup = np.array(struct.unpack('1000l', barr))
87.5 µs ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
frombuffer
looks pretty good.
Your struct.unpack
loop confuses me. I don't think it's doing the same thing as the frombuffer
. But like said at the start, I haven't used struct
much.
Upvotes: 2