Memory Error Parallel Python (Large data in parallel)

Question

So I'm reading in a lot of data from a bunch of different files. One of the major speed hurdles is reading in data. The problem is that these files are in respective directories for their timestep which contains each variable for that timestep. So basically I have some functions that look like this.

def ReadFiles(path,points,directories,variables):
   data = {}
   for j in range(len(variables)):
      data[variables[j]] = np.zeros((len(timeDirs),numPts))
      for i in range(len(timeDirs)):
         tfile = str(path) + "/" + str(timeDirs[i])
         for j in range(len(variables)):
            job_server.submit(ReadData,(args,data,i,j),modules=("np",))
def ReadData(args):
   update path for the particular variable
   read in data from file
   data[variables[j]][i] = that data

TLDR Initialize the variables I have with numpy array into a dict. Then update the correct section inside another function that is parallelized.

I am using parallel python. I would like to move this code to a cluster at some point. The error I get is

  File "/home/steven/anaconda2/lib/python2.7/site-packages/pp.py", line 460, in submit
    sargs = pickle.dumps(args, self.__pickle_proto)
MemoryError: out of memory

From watching my memory usage I can see that the RAM goes up and then the swap starts to fill. Once both are full I get the error. From some reading I gather that each of these processes is being passed a new dictionary, which means updating in parallel isn't an option.

I will note that I DO NOT get this memory error when I run this in serial. So is there a good way to store or pass this data back to my dictionary using parallel? The final data is stored into an hdf5 file, but those don't seem to want to be open and written to in parallel.

Is there a good solution? How do you handle large data in parallel?

Memory Error Parallel Python (Large data in parallel)

Answers (1)

Related Questions