Oliver
Oliver

Reputation: 3682

How to handle large memory footprint in Python?

I have a scientific application that reads a potentially huge data file from disk and transforms it into various Python data structures such as a map of maps, list of lists etc. NumPy is called in for numerical analysis. The problem is, the memory usage can grow rapidly. As swap space is called in, the system slows down significantly. The general strategy I have seen:

  1. lazy initialization: this doesn't seem to help in the sense that many operations require in memory data anyway.
  2. shelving: this Python standard library seems support writing data object into a datafile (backed by some db) . My understanding is that it dumps data to a file, but if you need it, you still have to load all of them into memory, so it doesn't exactly help. Please correct me if this is a misunderstanding.
  3. The third option is to leverage a database, and offload as much data processing to it

As an example: a scientific experiment runs several days and have generated a huge (tera bytes of data) sequence of:

co-ordinate(x,y) observed event E at time t.

And we need to compute a histogram over t for each (x,y) and output a 3-dimensional array.

Any other suggestions? I guess my ideal case would be the in-memory data structure can be phased to disk based on a soft memory limit and this process should be as transparent as possible. Can any of these caching frameworks help?

Edit:

I appreciate all the suggested points and directions. Among those, I found user488551's comments to be most relevant. As much as I like Map/Reduce, to many scientific apps, the setup and effort for parallelization of code is even a bigger problem to tackle than my original question, IMHO. It is difficult to pick an answer as my question itself is so open ... but Bill's answer is more close to what we can do in real world, hence the choice. Thank you all.

Upvotes: 3

Views: 357

Answers (2)

Bill Gribble
Bill Gribble

Reputation: 1797

Well, if you need the whole dataset in RAM, there's not much to do but get more RAM. Sounds like you aren't sure if you really need to, but keeping all the data resident requires the smallest amount of thinking :)

If your data comes in a stream over a long period of time, and all you are doing is creating a histogram, you don't need to keep it all resident. Just create your histogram as you go along, write the raw data out to a file if you want to have it available later, and let Python garbage collect the data as soon as you have bumped your histogram counters. All you have to keep resident is the histogram itself, which should be relatively small.

Upvotes: 1

Sid
Sid

Reputation: 7631

Have you considered divide and conquer? Maybe your problem lends itself to that. One framework you could use for that is Map/Reduce.

Does your problem have multiple phases such that Phase I requires some data as input and generates an output which can be fed to phase II? In that case you can have 1 process do phase I and generate data for phase II. Maybe this will reduce the amount of data you simultaneously need in memory?

Can you divide your problem into many small problems and recombine the solutions? In this case you can spawn multiple processes that each handle a small sub-problem and have one or more processes to combine these results in the end?

If Map-Reduce works for you look at the Hadoop framework.

Upvotes: 3

Related Questions