User19
User19

Reputation: 121

correct usage of multiprocessing for image download

I wrote the following function and tested it out in python shell and the images were downloaded successfully, however, when I ran it in a script no images were downloaded.

import os
import requests
from time import time
import uuid
from multiprocessing.pool import ThreadPool
main_file_name = 'test1.csv'

my_set = set()
with open(main_file_name, 'r') as f:  #read image urls 
    for row in f:
        my_set.add(row.split(',')[2].strip())

def get_url(entry):
    path = str(uuid.uuid4()) + ".jpg"
    if not os.path.exists(path):
        r = requests.get(entry, stream=True)
        if r.status_code == 200:
            with open(path, 'wb') as f:
                for chunk in r:
                    f.write(chunk)

start = time()
results = ThreadPool(8).imap_unordered(get_url, my_set)
print(f"Elapsed Time: {time() - start}")

I double-checked and it works in shell, is there anything I am missing from the script

Upvotes: 0

Views: 508

Answers (1)

wishmaster
wishmaster

Reputation: 1487

"results" is of class multiprocessing.pool.IMapUnorderedIterator, a good way to make sure the URLs are downloaded is to actually loop on results

start = time()
results = ThreadPool(8).imap_unordered(fetch_url, my_set)
for _ in results:
    pass
print(f"Elapsed Time: {time() - start}")

Another method which will also do the trick is to make sure the main thread completes before exiting from the script is to use time.sleep

from time import sleep
start = time()
results = ThreadPool(8).imap_unordered(fetch_url, my_set)
sleep(10)  # make sure this amount is enough to finish downloading
print(f"Elapsed Time: {time() - start}")

The reason your script doesn't work is that you immediately end the script after your initiation of results.The reason python3 -i test.py (or simply copy-paste your code in the shell) works is because the script was not killed (main thread lives) so the images had the time to be downloaded.

Upvotes: 1

Related Questions