Geoff
Geoff

Reputation: 1007

Python improve import time of data

I have a data file that contains the following:

somename = [ [1,2,3,4,5,6],...plus other such elements making a 60 MB file called somefile.py ]

In my python script (of the same folder as the data) I have (and only this along with appropriate shebang):

from somefile import somename

This took almost 20 minutes to complete. How can such an import be improved?

I'm using python 3.7 working on a mac osx 10.13.

Upvotes: 0

Views: 806

Answers (2)

Sam Mason
Sam Mason

Reputation: 16184

loading files as "Python source code" will always be relatively slow, but 20 minutes to load a 60MiB file seems far too slow. Python uses a full lexer/parser, and does things like tracking source locations for accurate error reporting amongst other things. It's grammer is deliberately simple which makes parsing relatively fast, but still it's going to be much slower than other file formats.

I'd go with one of the other suggestions, but thought it would be interesting to compare timings across different file formats

first I generate some data:

somename = [list(range(6)) for _ in range(100_000)]

this takes my computer 152 ms to do, I can then save this in a "Python source file" with:

with open('data.py', 'w') as fd:
    fd.write(f'somename = {somename}')

which takes 84.1 ms, reloading this using:

from data import somename

which takes 1.40 seconds — I tried with some other sizes and the scaling seems linear in array length which I find impressive. I then started to play with different file formats, JSON:

import json

with open('data.json', 'w') as fd:
    json.dump(somename, fd)

with open('data.json') as fd:
    somename = json.load(fd)

here saving took 787 ms and loading took 131 ms. Next, CSV:

import csv

with open('data.csv', 'w') as fd:
    out = csv.writer(fd)
    out.writerows(somename)

with open('data.csv') as fd:
    inp = csv.reader(fd)
    somename = [[int(v) for v in row] for row in inp]

saving took 114 ms while loading took 329 ms (down to 129 ms if strings aren't converted to ints). next I tried musbur's suggestion of pickle:

import pickle  # no need for `cPickle` in Python 3

with open('data.pck', 'wb') as fd:
    pickle.dump(somename, fd)

with open('data.pck', 'rb') as fd:
    somename = pickle.load(fd)

the saving took 49.1 ms and loading took 128 ms

The take-home message sees to be that saving data in source code takes 10 times as long, but I'm not sure how it's taking your computer 20 minutes!

Upvotes: 3

musbur
musbur

Reputation: 679

The somename.py file is obviously created by some piece of software. If it is re-created regularly (i.e., changes often), that other piece of software should be rewritten in such a way to create data that is more easily imported in Python (such as tabular text data, JSON, yaml, ...). If it is static data that never changes, do this:

import cPickle
from somefile import somename

fh = open("data.pck", "wb")
cPickle.dump(somename, fh)
fh.close()

This will serialize your data into a file "data.pck" from which it can be re-loaded very quickly.

Upvotes: 0

Related Questions