Reputation: 544
So I'm reading in a lot of data from a bunch of different files. One of the major speed hurdles is reading in data. The problem is that these files are in respective directories for their timestep which contains each variable for that timestep. So basically I have some functions that look like this.
def ReadFiles(path,points,directories,variables):
data = {}
for j in range(len(variables)):
data[variables[j]] = np.zeros((len(timeDirs),numPts))
for i in range(len(timeDirs)):
tfile = str(path) + "/" + str(timeDirs[i])
for j in range(len(variables)):
job_server.submit(ReadData,(args,data,i,j),modules=("np",))
def ReadData(args):
update path for the particular variable
read in data from file
data[variables[j]][i] = that data
TLDR Initialize the variables I have with numpy array into a dict. Then update the correct section inside another function that is parallelized.
I am using parallel python. I would like to move this code to a cluster at some point. The error I get is
File "/home/steven/anaconda2/lib/python2.7/site-packages/pp.py", line 460, in submit
sargs = pickle.dumps(args, self.__pickle_proto)
MemoryError: out of memory
From watching my memory usage I can see that the RAM goes up and then the swap starts to fill. Once both are full I get the error. From some reading I gather that each of these processes is being passed a new dictionary, which means updating in parallel isn't an option.
I will note that I DO NOT get this memory error when I run this in serial. So is there a good way to store or pass this data back to my dictionary using parallel? The final data is stored into an hdf5 file, but those don't seem to want to be open and written to in parallel.
Is there a good solution? How do you handle large data in parallel?
Upvotes: 1
Views: 784
Reputation: 17853
Reading data in parallel isn't likely to gain you anything since you're going to be I/O bound for as long as you're reading data, one file at a time or all at once. I'd switch it around to read the data serially but kick off the data setup (in parallel, natch) once each file is loaded. If you can load the entire file in one go then process it from memory you may see the performance gains you seek, at the expense of memory.
If you're exhausting memory then you'll need to figure out how to write out some of the data as you go so that you can drop those dictionary entries.
It might be feasible to memory map the files instead of explicitly reading them, then parallelizing the processing may make more sense, depending on the speed of your data processing versus the speed of I/O. That will allow you to leverage the OS's scheduling, assuming your data processing while loading takes long enough.
Upvotes: 0