erickrf
erickrf

Reputation: 2104

Fastest way to read a list of numbers from a file

I have found a few similar questions here in Stack Overflow, but I believe I could benefit from advice specific for my case.

I must store around 80 thousand lists of real valued numbers in a file and read them back later.

First, I tried cPickle, but the reading time wasn't appealing:

>>> stmt = """
with open('pickled-data.dat') as f:
    data = cPickle.load(f)
"""
>>> timeit.timeit(stmt, 'import cPickle', number=1)
3.8195440769195557

Then I found out that storing the numbers as plain text allows faster reading (makes sense, since cPickle must worry about a lot of things):

>>> stmt = """
data = []
with open('text-data.dat') as f:
    for line in f:
        data.append([float(x) for x in line.split()])
"""
>>> timeit.timeit(stmt, number=1)
1.712096929550171

This is a good improvement, but I think I could still optimize it somehow, since programs written in other languages can read similar data from files considerably faster.

Any ideas?

Upvotes: 1

Views: 831

Answers (1)

mgilson
mgilson

Reputation: 310069

If numpy arrays are workable, numpy.fromfile will likely be the fastest option to read the files (here's a somewhat related question I asked just a couple days ago)

Alternatively, it seems like you could do a little better with struct, though I haven't tested it:

import struct
def write_data(f,data):
    f.write(struct.pack('i',len()))
    for lst in data:
        f.write(struct.pack('i%df'%len(lst),len(lst),*lst))

def read_data(f):
    def read_record(f):
        nelem = struct.unpack('i',f.read(4))[0]
        return list(struct.unpack('%df'%nelem,f.read(nelem*4))) #if tuples are Ok, remove the `list`.

    nrec = struct.unpack('i',f.read(4))[0]
    return [ read_record(f) for i in range(nrec) ]

This assumes that storing the data as 4-byte floats is good enough. If you want a real double precision number, change the format statements from f to d and change nelem*4 to nelem*8. There might be some minor portability issues here (endianness and sizeof datatypes for example).

Upvotes: 2

Related Questions