dkalev
dkalev

Reputation: 639

When downloading files in multiple threads in python, threads finish before all files are available

When downloading images in multiple threads, some of the images are not available after all threads have finished. This is the simplified code for the image download and how the threads are started and joined. I am expecting all the file handles to be released and the files to be available after the thread queue is empty.

def download_image(id, link, path):
    response = requests.get(link)
    if response.status_code == 200:
        filename = os.path.join(path, f"{id}.jpg")

    with open( filename, 'wb' ) as file:
        file.write(response.content)



queue = [threading.Thread(target=download_image, args=(id, link, image_dir)) for id, link in photos_to_download]

while queue:
    threads = queue[:thread_amount] # pick 'thread_amount' threads from the queue 
    threads = [t.start() for t in threads] # start the threads
    threads = [t.join() for t in threads if t != None] # wait for the threads to finish execution
    queue = queue[thread_amount:] # remove finished tasks from queue

However, after the threads are finished not all images are available and with the code below I explicitly wait until there are no more new files added to the image directory.

n_images = len(os.listdir(image_dir))
time.sleep(5)
while len(os.listdir(image_dir)) > n_images: 
    n_images = len(os.listdir(image_dir))
    time.sleep(5)

Is that a problem with the threading, the os library or is the OS (Windows) not able to register the new files immediately?

Upvotes: 0

Views: 325

Answers (1)

dkalev
dkalev

Reputation: 639

The problem was that thread.start() returns NoneType and in these lines I was trying to call join on a list of None, instead of the actual threads:

threads = [t.start() for t in threads]
threads = [t.join() for t in threads if t != None]

Upvotes: 0

Related Questions