Fastest way to load a database dump from storage in Python

Question

I have a remote database from where I need to pull around 3 million records, multiple times (I'm experimenting with the data). Since the database is remote, querying it repeatedly takes ages. So I'm thinking of pulling the data once, and storing it locally in some format. Local storage space is not a big issue; I can manage a few gigabytes. What would be the fastest way to store this data locally, so that retrieval is fast? I'm guessing optimisation is possible at 2 levels; data structure, and storage technique. In terms of data format, I've been using tuples. I'm guessing a pandas dataframe would be faster than tuples. Is there an even faster format?

On the storage technique, I've been using pickle, but that's probably not the best way. I've read about HDF5, cPickle, but I'm not sure which one would work best for a database dump, and whether the underlying data structure matters. Any other alternatives?

Fastest way to load a database dump from storage in Python

Answers (1)

Related Questions