Reputation: 2104
I have found a few similar questions here in Stack Overflow, but I believe I could benefit from advice specific for my case.
I must store around 80 thousand lists of real valued numbers in a file and read them back later.
First, I tried cPickle
, but the reading time wasn't appealing:
>>> stmt = """
with open('pickled-data.dat') as f:
data = cPickle.load(f)
"""
>>> timeit.timeit(stmt, 'import cPickle', number=1)
3.8195440769195557
Then I found out that storing the numbers as plain text allows faster reading (makes sense, since cPickle
must worry about a lot of things):
>>> stmt = """
data = []
with open('text-data.dat') as f:
for line in f:
data.append([float(x) for x in line.split()])
"""
>>> timeit.timeit(stmt, number=1)
1.712096929550171
This is a good improvement, but I think I could still optimize it somehow, since programs written in other languages can read similar data from files considerably faster.
Any ideas?
Upvotes: 1
Views: 831
Reputation: 310069
If numpy arrays are workable, numpy.fromfile
will likely be the fastest option to read the files (here's a somewhat related question I asked just a couple days ago)
Alternatively, it seems like you could do a little better with struct
, though I haven't tested it:
import struct
def write_data(f,data):
f.write(struct.pack('i',len()))
for lst in data:
f.write(struct.pack('i%df'%len(lst),len(lst),*lst))
def read_data(f):
def read_record(f):
nelem = struct.unpack('i',f.read(4))[0]
return list(struct.unpack('%df'%nelem,f.read(nelem*4))) #if tuples are Ok, remove the `list`.
nrec = struct.unpack('i',f.read(4))[0]
return [ read_record(f) for i in range(nrec) ]
This assumes that storing the data as 4-byte floats is good enough. If you want a real double precision number, change the format statements from f to d and change nelem*4
to nelem*8
. There might be some minor portability issues here (endianness and sizeof datatypes for example).
Upvotes: 2