Mathieu Gauquelin
Mathieu Gauquelin

Reputation: 635

Convert data faster (from byte to 3D numpy array)

I have to read a binary file which contains 1300 images of 320*256 of uint8 pixels and convert this to a numpy array. Data convert from byte with struct.unpack is on the following form : b'\xbb\x17\xb4\x17\xe2\x17\xc3\x17\xd3\x17'. The saved data is on the following form:

Main header / Frame header1 / Frame1 / Frame header2  / Frame2 / etc.

Sorry I can't give you the file.

EDIT : new version of the code (3Go during manipulation, 1,5Go use in RAM at final) -- Thanks to Paul

import struct, numpy as np, matplotlib.pyplot as plt
filename = 'blabla'
with open(filename, mode="rb") as f:
    # Initialize variables
    width = 320
    height = 256
    frame_nb_octet = width * height * 2
    count_frame = 1300
    fmt = "<" + "H" * width * height  # little endian and unsigned short
    main_header_size = 4000
    frame_header_size = 100
    data = []
    tab = []

    # Read all images (<=> all the file to read once)
    data.append(f.read())
    data = data[0]

    # -------------- BEFORE --------------
    # # Convert bytes into int (be careful to pass main/fram headers)
    # for indice in range(count_frame):
    #     ind_start = main_header_size + indice * (frame_header_size + frame_nb_octet) + frame_header_size
    #     ind_end = ind_start + frame_nb_octet
    #     tab.append(struct.unpack(fmt, data[ind_start:ind_end]))
    # images = np.resize(np.array(tab), (count_frame, height, width))
    # ------------------------------------

    # Convert bytes into float (because after, mean, etc) passing main/frame headers
    dt = np.dtype(np.uint16)
    dt = dt.newbyteorder(('<'))
    array = np.empty((frame_nb_octet, count_frame), dtype=float)
    for indice in range(count_frame):
        offset = main_header_size + indice * (frame_header_size + frame_nb_octet) + frame_header_size
        array[:, indice] = np.frombuffer(data, dtype=dt, count=frame_nb_octet, offset=offset)
    array = np.resize(array, (height, width, count_frame))

    # Plotting first image to verify data
    fig = plt.figure()
    # plt.imshow(np.squeeze(images[0, :, :]))
    plt.imshow(np.squeeze(array[:, :, 0]))
    plt.show()

Performances:

Is there other way to convert faster my data, or using less RAM ?

Thank you in advance for your help/advice.

Upvotes: 1

Views: 1204

Answers (1)

user7138814
user7138814

Reputation: 2041

Try a memory map:

dtype = [('headers', np.void, frame_header_size), ('frames', '<u2', (height, width))]
mmap = np.memmap(filename, dtype, offeset=main_header_size)
array = mmap['frames']

You can convert it to floating point with .astype if needed.


Actually, to be less cryptic, the clever thing here is using a "structured array", not so much the memory map. You can read about structured arrays in these numpy docs. The trick then becomes choosing a dtype that exactly mathes the format of the data.

We can skip the main header by choosing an offset for the memory map. As an alternative we could have done it like this:

fh = open(filename, 'rb')
fh.seek(main_header_size)
data = np.fromfile(fh, our_structured_dtype)

That leaves the frame data and frame headers. Luckily every frame and frame header has the same size, so we can describe them with a structured dtype. We're not really interested in the frame headers so we give them a void dtype of the specified size. For the data itself we have height * width values, for which we use a convenient sub-array format. We use typestring <u2 to specify "little-endian unsigned short", see numpy docs on data types. Now numpy has all info it needs to read the data in exactly the right format.

Basically, with a strcutured dtype you can describe data layout of a numpy array into fine detail. And then with np.memmap or np.fromfile you can load data in this format from disk.

Upvotes: 1

Related Questions