Thibaut Mattio
Thibaut Mattio

Reputation: 832

How does joblib Parallel function manage the memory?

I am writing a function to convert PDF to PNG images, it looks like this:

import os
from wand.image import Image

def convert_pdf(filename, resolution):
    with Image(filename=filename, resolution=resolution) as img:
        pages_dir = os.path.join(os.path.dirname(filename), 'pages')
        page_filename = os.path.splitext(os.path.basename(filename))[0] + '.png'
        os.makedirs(pages_dir)
        img.save(filename=os.path.join(pages_dir, page_filename))

When I try to parallelize it, the memory is growing and I cannot finish the processing of my PDF files:

def convert(dataset, resolution):
    Parallel(n_jobs=-1, max_nbytes=None)(
        delayed(convert_pdf)(filename, resolution) for filename in glob.iglob(dataset + '/**/*.pdf', recursive=True)
    )

When I call the function in serial, the memory stay constant.

How joblib manage the memory allocation for each parallel instance?

How can I modify my code so that the memory stay constant when running in parallel?

Upvotes: 3

Views: 3048

Answers (1)

sascha
sascha

Reputation: 33532

Joblib will use serialization techniques to pass the data to all your workers. Of course the memory will grow with the number of workers.

From the docs:

By default the workers of the pool are real Python processes forked using the multiprocessing module of the Python standard library when n_jobs != 1. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.

There is no way to process 2 files in parallel with only the memory of 1 (if you really want a speedup)!

The docs also mention memory-maps which are often used for numerical-data and when those workers share data (OS is responsible for caching then). This won't help here because there is no shared data in your case. But as memory-maps are automatically kept memory-friendly in regards to caching, memory-based program crashes should not happen in this case, but of course this IO done (opposed to caching) will cost performance.

So in short:

  • using X cores, expect X times more memory-usage
    • there is nothing you can do
  • if you observe much more memory consumption than the linear one which is expected, something seems wrong
  • i'm not sure how many core you have, but you can try to limit this with n_jobs=4 for example
  • this kind of IO-heavy processing is not a natural candidate for parallel-processing
    • IO is dominating computation!!!

Upvotes: 3

Related Questions