Reputation: 639
When downloading images in multiple threads, some of the images are not available after all threads have finished. This is the simplified code for the image download and how the threads are started and joined. I am expecting all the file handles to be released and the files to be available after the thread queue is empty.
def download_image(id, link, path):
response = requests.get(link)
if response.status_code == 200:
filename = os.path.join(path, f"{id}.jpg")
with open( filename, 'wb' ) as file:
file.write(response.content)
queue = [threading.Thread(target=download_image, args=(id, link, image_dir)) for id, link in photos_to_download]
while queue:
threads = queue[:thread_amount] # pick 'thread_amount' threads from the queue
threads = [t.start() for t in threads] # start the threads
threads = [t.join() for t in threads if t != None] # wait for the threads to finish execution
queue = queue[thread_amount:] # remove finished tasks from queue
However, after the threads are finished not all images are available and with the code below I explicitly wait until there are no more new files added to the image directory.
n_images = len(os.listdir(image_dir))
time.sleep(5)
while len(os.listdir(image_dir)) > n_images:
n_images = len(os.listdir(image_dir))
time.sleep(5)
Is that a problem with the threading, the os library or is the OS (Windows) not able to register the new files immediately?
Upvotes: 0
Views: 325
Reputation: 639
The problem was that thread.start() returns NoneType and in these lines I was trying to call join on a list of None, instead of the actual threads:
threads = [t.start() for t in threads]
threads = [t.join() for t in threads if t != None]
Upvotes: 0