correct usage of multiprocessing for image download

Question

I wrote the following function and tested it out in python shell and the images were downloaded successfully, however, when I ran it in a script no images were downloaded.

import os
import requests
from time import time
import uuid
from multiprocessing.pool import ThreadPool
main_file_name = 'test1.csv'

my_set = set()
with open(main_file_name, 'r') as f:  #read image urls 
    for row in f:
        my_set.add(row.split(',')[2].strip())

def get_url(entry):
    path = str(uuid.uuid4()) + ".jpg"
    if not os.path.exists(path):
        r = requests.get(entry, stream=True)
        if r.status_code == 200:
            with open(path, 'wb') as f:
                for chunk in r:
                    f.write(chunk)

start = time()
results = ThreadPool(8).imap_unordered(get_url, my_set)
print(f"Elapsed Time: {time() - start}")

I double-checked and it works in shell, is there anything I am missing from the script

wishmaster · Accepted Answer

"results" is of class multiprocessing.pool.IMapUnorderedIterator, a good way to make sure the URLs are downloaded is to actually loop on results

start = time()
results = ThreadPool(8).imap_unordered(fetch_url, my_set)
for _ in results:
    pass
print(f"Elapsed Time: {time() - start}")

Another method which will also do the trick is to make sure the main thread completes before exiting from the script is to use time.sleep

from time import sleep
start = time()
results = ThreadPool(8).imap_unordered(fetch_url, my_set)
sleep(10)  # make sure this amount is enough to finish downloading
print(f"Elapsed Time: {time() - start}")

The reason your script doesn't work is that you immediately end the script after your initiation of results.The reason python3 -i test.py (or simply copy-paste your code in the shell) works is because the script was not killed (main thread lives) so the images had the time to be downloaded.

correct usage of multiprocessing for image download

Answers (1)

Related Questions