Reputation: 832
I am writing a function to convert PDF to PNG images, it looks like this:
import os
from wand.image import Image
def convert_pdf(filename, resolution):
with Image(filename=filename, resolution=resolution) as img:
pages_dir = os.path.join(os.path.dirname(filename), 'pages')
page_filename = os.path.splitext(os.path.basename(filename))[0] + '.png'
os.makedirs(pages_dir)
img.save(filename=os.path.join(pages_dir, page_filename))
When I try to parallelize it, the memory is growing and I cannot finish the processing of my PDF files:
def convert(dataset, resolution):
Parallel(n_jobs=-1, max_nbytes=None)(
delayed(convert_pdf)(filename, resolution) for filename in glob.iglob(dataset + '/**/*.pdf', recursive=True)
)
When I call the function in serial, the memory stay constant.
How joblib manage the memory allocation for each parallel instance?
How can I modify my code so that the memory stay constant when running in parallel?
Upvotes: 3
Views: 3048
Reputation: 33532
Joblib will use serialization techniques to pass the data to all your workers. Of course the memory will grow with the number of workers.
From the docs:
By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.
There is no way to process 2 files in parallel with only the memory of 1 (if you really want a speedup)!
The docs also mention memory-maps which are often used for numerical-data and when those workers share data (OS is responsible for caching then). This won't help here because there is no shared data in your case. But as memory-maps are automatically kept memory-friendly in regards to caching, memory-based program crashes should not happen in this case, but of course this IO done (opposed to caching) will cost performance.
So in short:
n_jobs=4
for exampleUpvotes: 3