Prateek Dewan
Prateek Dewan

Reputation: 1631

Fastest way to load a database dump from storage in Python

I have a remote database from where I need to pull around 3 million records, multiple times (I'm experimenting with the data). Since the database is remote, querying it repeatedly takes ages. So I'm thinking of pulling the data once, and storing it locally in some format. Local storage space is not a big issue; I can manage a few gigabytes. What would be the fastest way to store this data locally, so that retrieval is fast? I'm guessing optimisation is possible at 2 levels; data structure, and storage technique. In terms of data format, I've been using tuples. I'm guessing a pandas dataframe would be faster than tuples. Is there an even faster format?

On the storage technique, I've been using pickle, but that's probably not the best way. I've read about HDF5, cPickle, but I'm not sure which one would work best for a database dump, and whether the underlying data structure matters. Any other alternatives?

Upvotes: 1

Views: 419

Answers (1)

Yuuty
Yuuty

Reputation: 119

I think you'll get the fastest results using sqlite3, in my opinion the most efficient. It's really nice for dealing with large sets of data (in this case your 3 million records).

Maybe you've read up on it and know it can't alone work with remote databases (since it has no network features itself), but if you make the database available on a shared, network-accessible drive, it will work just fine. You can also accomplish this using Navicat by connecting to a remote sqlite database through SSH.

Upvotes: 1

Related Questions