Performance when downloading thousands of images

Question

I have a function that downloads thousands of images at a time from a 3rd party source. The number of images can range from 2,500-250,000 per run. As you can imagine, this process takes some time and am looking to optimize the best I can.

The way it works is I take a list of image paths, do a loop through them and request the image from the 3rd party. Currently, before I make the request, I do a check to see if the image already exists on the server...if it does, it skips that image...if it does not, it downloads it.

My question is if anyone knows if the check before the download is slowing down the process (or possibly speeding it up)? Would it be more efficient to download the file and let it override for already existing images, thus cutting out the step of checking for existence?

If anyone else has any tips for downloading this volume of images, they are welcome!

sammy_winter · Accepted Answer

The real answer depends on three things:
1: how often you come across an image that already exists. The less often you have a hit, the less useful checking is. 2: The latency of the destination storage. Is the destination storage location local or far away? if it is in India with a 300ms latency (and probable high packet loss), the check becomes more expensive relative to the download. This is mitigated significantly by smart threading. 3: Your bandwidth / throughput from your source to your destination. The higher your bandwidth, the less downloading a file twice costs you.

If you have a less than 1% hit rate for images that already exist, you're not getting much of a gain from the check (max ~1%), but if 90% of the images already exist, it would be probably be worth checking even if the destination file store is remote / far away. Either way it is a balancing act, but if you have a hit rate high enough to ask, its likely that checking to see if you already have the file would be useful.

If images you already have don't get deleted, the best way to do this would probably be to keep a database of images that you've downloaded, and check your list of files to download against that database.

If that isn't feasible because images get deleted / renamed or something, minimize the impact of the check by threading it. The performance difference between foreach and Parallel.ForEach for operations with high latency are huge.

Finally, 250k images can be a lot of data if they're large images. It might be faster to send physical media (i.e. put the data on a hard drive and send the drive).

Performance when downloading thousands of images

Answers (2)

Related Questions