How does joblib Parallel function manage the memory?

Question

I am writing a function to convert PDF to PNG images, it looks like this:

import os
from wand.image import Image

def convert_pdf(filename, resolution):
    with Image(filename=filename, resolution=resolution) as img:
        pages_dir = os.path.join(os.path.dirname(filename), 'pages')
        page_filename = os.path.splitext(os.path.basename(filename))[0] + '.png'
        os.makedirs(pages_dir)
        img.save(filename=os.path.join(pages_dir, page_filename))

When I try to parallelize it, the memory is growing and I cannot finish the processing of my PDF files:

def convert(dataset, resolution):
    Parallel(n_jobs=-1, max_nbytes=None)(
        delayed(convert_pdf)(filename, resolution) for filename in glob.iglob(dataset + '/**/*.pdf', recursive=True)
    )

When I call the function in serial, the memory stay constant.

How joblib manage the memory allocation for each parallel instance?

How can I modify my code so that the memory stay constant when running in parallel?

sascha · Accepted Answer

Joblib will use serialization techniques to pass the data to all your workers. Of course the memory will grow with the number of workers.

From the docs:

By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.

There is no way to process 2 files in parallel with only the memory of 1 (if you really want a speedup)!

The docs also mention memory-maps which are often used for numerical-data and when those workers share data (OS is responsible for caching then). This won't help here because there is no shared data in your case. But as memory-maps are automatically kept memory-friendly in regards to caching, memory-based program crashes should not happen in this case, but of course this IO done (opposed to caching) will cost performance.

So in short:

using X cores, expect X times more memory-usage
- there is nothing you can do
if you observe much more memory consumption than the linear one which is expected, something seems wrong
i'm not sure how many core you have, but you can try to limit this with n_jobs=4 for example
this kind of IO-heavy processing is not a natural candidate for parallel-processing
- IO is dominating computation!!!

How does joblib Parallel function manage the memory?

Answers (1)

Related Questions