How do I make large datasets load quickly in Python?

I do data mining research and often have Python scripts that load large datasets from SQLite databases, CSV files, pickle files, etc. In the development process, my scripts often need to be changed and I find myself waiting 20 to 30 seconds waiting for data to load.

Loading data streams (e.g. from a SQLite database) sometimes works, but not in all situations -- if I need to go back into a dataset often, I'd rather pay the upfront time cost of loading the data.

My best solution so far is subsampling the data until I'm happy with my final script. Does anyone have a better solution/design practice?

My "ideal" solution would involve using the Python debugger (pdb) cleverly so that the data remains loaded in memory, I can edit my script, and then resume from a given point.

Upvotes: 3

Answers (3)

Oppy

Reputation: 2897

Jupyter notebook allows you to load a large data set into a memory resident data structure, such as a Pandas dataframe in one cell. Then you can operate on that data structure in subsequent cells without having to reload the data.

Upvotes: 0

gbronner

Reputation: 1945

Write a script that does the selects, the object-relational conversions, then pickles the data to a local file. Your development script will start by unpickling the data and proceeding.

If the data is significantly smaller than physical RAM, you can memory map a file shared between two processes, and write the pickled data to memory.

Upvotes: 0

dfb

Reputation: 13289

One way to do this would be to keep your loading and manipulation scripts in separate files X and Y and have X.py read

import Y
data = Y.load()
.... your code ....

When you're coding X.py, you omit this part from the file and manually run it in an interactive shell. Then you can modify X.py and do an import X in the shell to test your code.

Upvotes: 3

How do I make large datasets load quickly in Python?

Answers (3)

Related Questions