Vingtoft
Vingtoft

Reputation: 14596

Fetching data with multiprocessing

I have a function posting to a server (AWS Lambda) to perform OCR of an base64 image:

def image_to_text(image64):
    url = base_url + 'text-to-image'
    data = json.dumps({'image64': image64})
    r = requests.post(url, data)
    r.raise_for_status()
    return r.json()['text'].encode('utf-8')

The function works: image_to_text('some long string') will return a proper response.

Problem: Using image_to_text in parallel (multi-process) makes the application halt (without warnings or errors) at r = request.post(url, data)

Example:

import multiprocessing as mp
from multiprocessing import cpu_count

p = mp.Pool(cpu_count())
p.map(image_to_text, ('A long string',
                      'Another long string'))
p.terminate()

Question: Why is my application halting and how can I use multiprocessing to fetch data with requests?

Upvotes: 1

Views: 540

Answers (1)

noxdafox
noxdafox

Reputation: 15030

Your application probably is not halting but it's slowing down significantly due to the large amount of data you are transferring through the Pool. What you perceive as a "halt" is actually big IPC overhead.

From the multiprocessing programming guidelines.

Avoid shared state

As far as possible one should try to avoid shifting large amounts of data between processes.

The Pool relies on an internal pipe to transfer the data to the workers which will execute your image_to_text function. This pipe becomes a bottleneck if the amount of data which needs to be delivered is huge. In your case, you are sending the data back and forth doubling the amount of bytes which need to be serialized and shipped.

I'd recommend you to dump the data to temporary files and send to image_to_text the file names only. The image_to_text will open and read the data from the files autonomously. You will notice your logic becoming significantly faster and more robust as well.

Upvotes: 1

Related Questions