Reputation: 1007
I have a data file that contains the following:
somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]
In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):
from somefile import somename
This took almost 20 minutes to complete. How can such an import be improved?
I'm using python 3.7 working on a mac osx 10.13.
Upvotes: 0
Views: 806
Reputation: 16184
loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.
I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats
first I generate some data:
somename = [list(range(6)) for _ in range(100_000)]
this takes my computer 152 ms to do, I can then save this in a "Python source file" with:
with open('data.py', 'w') as fd:
fd.write(f'somename = {somename}')
which takes 84.1 ms, reloading this using:
from data import somename
which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:
import json
with open('data.json', 'w') as fd:
json.dump(somename, fd)
with open('data.json') as fd:
somename = json.load(fd)
here saving took 787 ms and loading took 131 ms. Next, CSV:
import csv
with open('data.csv', 'w') as fd:
out = csv.writer(fd)
out.writerows(somename)
with open('data.csv') as fd:
inp = csv.reader(fd)
somename = [[int(v) for v in row] for row in inp]
saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to int
s). next I tried musbur's suggestion of pickle
:
import pickle # no need for `cPickle` in Python 3
with open('data.pck', 'wb') as fd:
pickle.dump(somename, fd)
with open('data.pck', 'rb') as fd:
somename = pickle.load(fd)
the saving took 49.1 ms and loading took 128 ms
The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!
Upvotes: 3
Reputation: 679
The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:
import cPickle
from somefile import somename
fh = open("data.pck", "wb")
cPickle.dump(somename, fh)
fh.close()
This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.
Upvotes: 0