How to handle large memory footprint in Python?

Question

I have a scientific application that reads a potentially huge data file from disk and transforms it into various Python data structures such as a map of maps, list of lists etc. NumPy is called in for numerical analysis. The problem is, the memory usage can grow rapidly. As swap space is called in, the system slows down significantly. The general strategy I have seen:

lazy initialization: this doesn't seem to help in the sense that many operations require in memory data anyway.
shelving: this Python standard library seems support writing data object into a datafile (backed by some db) . My understanding is that it dumps data to a file, but if you need it, you still have to load all of them into memory, so it doesn't exactly help. Please correct me if this is a misunderstanding.
The third option is to leverage a database, and offload as much data processing to it

As an example: a scientific experiment runs several days and have generated a huge (tera bytes of data) sequence of:

co-ordinate(x,y) observed event E at time t.

And we need to compute a histogram over t for each (x,y) and output a 3-dimensional array.

Any other suggestions? I guess my ideal case would be the in-memory data structure can be phased to disk based on a soft memory limit and this process should be as transparent as possible. Can any of these caching frameworks help?

Edit:

I appreciate all the suggested points and directions. Among those, I found user488551's comments to be most relevant. As much as I like Map/Reduce, to many scientific apps, the setup and effort for parallelization of code is even a bigger problem to tackle than my original question, IMHO. It is difficult to pick an answer as my question itself is so open ... but Bill's answer is more close to what we can do in real world, hence the choice. Thank you all.

Bill Gribble · Accepted Answer

Well, if you need the whole dataset in RAM, there's not much to do but get more RAM. Sounds like you aren't sure if you really need to, but keeping all the data resident requires the smallest amount of thinking :)

If your data comes in a stream over a long period of time, and all you are doing is creating a histogram, you don't need to keep it all resident. Just create your histogram as you go along, write the raw data out to a file if you want to have it available later, and let Python garbage collect the data as soon as you have bumped your histogram counters. All you have to keep resident is the histogram itself, which should be relatively small.

How to handle large memory footprint in Python?

Answers (2)

Related Questions