How to asynchronically download millions of files from a file storage?

Question

Let's assume I have a database managing millions of documents, which are stored on a WebDav or SMB server, which does not support retrieving documents in bulks. Given a list of (potentially all) document IDs, how do I download the corresponding documents as fast as possible?

Iterating over the list and sequentially downloading them is far too slow. The 2 options I see is threads and async downloads.

My gut says that async programming should be preferred to threads, because I'm just waiting for IO on the client side. But I am rather new to async programming and I don't know how to do it. I assume that iterating over the whole list and sending an async download request could potentially lead to too many requests in a very short time leading to rejected requests. So how do I throttle this? Is there a best practice way to do this?

Theodor Zoulias · Accepted Answer

Take a look at this: How to limit the amount of concurrent async I/O? Using a SemaphoreSlim, as suggested in the accepted answer, is an easy and quite good solution.

My personal favorite though for this kind if job is the TPL Dataflow library. You can see here an example of using this library to download pages from the web asynchronously with a configurable level of concurrency, in combination with the HttpClient class. Here is another example.

How to asynchronically download millions of files from a file storage?

Answers (2)

Related Questions