Greg
Greg

Reputation: 251

Read a large big-endian binary file

I have a very large big-endian binary file. I know how many numbers in this file. I found a solution how to read big-endian file using struct and it works perfect if file is small:

    data = []
    file = open('some_file.dat', 'rb')

    for i in range(0, numcount)
            data.append(struct.unpack('>f', file.read(4))[0])

But this code works very slow if file size is more than ~100 mb. My current file has size 1.5gb and contains 399.513.600 float numbers. The above code works with this file an about 8 minutes.

I found another solution, that works faster:

    datafile = open('some_file.dat', 'rb').read()
    f_len = ">" + "f" * numcount   #numcount = 399513600

    numbers = struct.unpack(f_len, datafile)

This code runs in about ~1.5 minute, but this is too slow for me. Earlier I wrote the same functional code in Fortran and it run in about 10 seconds.

In Fortran I open the file with flag "big-endian" and I can simply read file in REAL array without any conversion, but in python I have to read file as a string and convert every 4 bites in float using struct. Is it possible to make the program run faster?

Upvotes: 7

Views: 13199

Answers (3)

Pedro Nunes
Pedro Nunes

Reputation: 1

def read_big_endian(filename):
    all_text = ""
    with open(filename, "rb") as template:
        try:
            template.read(2)  # first 2 bytes are FF FE
            while True:
                dchar = template.read(2)
                all_text += dchar[0]
        except:
            pass
    return all_text


def save_big_endian(filename, text):
    with open(filename, "wb") as fic:
        fic.write(chr(255) + chr(254))  # first 2 bytes are FF FE
        for letter in text:
            fic.write(letter + chr(0))

Used to read .rdp files

Upvotes: 0

Martin Evans
Martin Evans

Reputation: 46779

The following approach gave a good speed up for me:

import struct
import random
import time


block_size = 4096
start = time.time()

with open('some_file.dat', 'rb') as f_input:    
    data = []

    while True:
        block = f_input.read(block_size * 4)
        data.extend(struct.unpack('>{}f'.format(len(block)/4), block))

        if len(block) < block_size * 4:
            break

print "Time taken: {:.2f}".format(time.time() - start)
print "Length", len(data)

Rather than using >fffffff you can specify a count e.g. >1000f. It reads the file 4096 chunks at a time. If the amount read is less than this it adjusts the block size and exits.

From the struct - Format Characters documentation:

A format character may be preceded by an integral repeat count. For example, the format string '4h' means exactly the same as 'hhhh'.

Upvotes: 1

Bakuriu
Bakuriu

Reputation: 101999

You can use numpy.fromfile to read the file, and specify that the type is big-endian specifying > in the dtype parameter:

numpy.fromfile(filename, dtype='>f')

There is an array.fromfile method too, but unfortunately I cannot see any way in which you can control endianness, so depending on your use case this might avoid the dependency on a third party library or be useless.

Upvotes: 7

Related Questions