drhanlau
drhanlau

Reputation: 2527

Download multiple HTML at once in C#

I have a list of web address (> 10k) that need to download, if I download using a single thread application is very time consuming, which one of multithread or multiple backgroundworker instance is a better option and why?

Upvotes: 4

Views: 2208

Answers (7)

Jim Mischel
Jim Mischel

Reputation: 133950

The approach you should use depends in large part on just how quickly you want to download those 10,000 pages and how often you want to do it.

In general, you can expect a single-threaded download application to average about one page per second. Your results will vary depending on the sites you're downloading from. Getting stuff from yahoo.com is going to be faster than downloading from a server that somebody's hosting on a cable modem. The nice thing about a single-threaded download application is that it's very easy to write. If you only need to download those pages once, write the single-threaded app, put it to work, and take a long lunch. You'll have your data in about three hours.

If you have a quad-core machine, you can do about four pages per second. Just write your single-threaded application, split your URLs list into four equal pieces, start four instances of your application, and take a regular lunch. You'll have the data when you get back.

If you'll be downloading those pages on a regular basis, then you can write your program to maintain a BlockingCollection for your URLs. Spin up four threads, each of which does essentially this:

while (queue not empty)
{
    dequeue url
    download page
}

That will execute in the same amount of time as having four separate instances of the single-threaded downloader. Actually, it will probably execute slightly faster because the you're not splitting the queue. So you don't have the problem of one thread finishing its queue and stopping while there are still URLs left to download. Again, the program is incredibly easy to write and you'll have those 10,000 pages in under an hour.

You can go much faster than that. At typical cable modem speeds, you can achieve close to 20 pages per second without too much trouble. Forget using the TPL or ThreadPool.QueueUserWorkItem, etc. Instead, use WebClient and DownloadDataAsync. Create a queue of, say, 10 WebClient instances. Then, your main thread does this:

while (url queue is not empty)
{
    client = dequeue WebClient // this will block if all clients are currently busy
    url = dequeue url
    client.DownloadDataAsync(url)
}

The WebClient instance's DownloadDataCompleted event handler will be called when the download is completed, so you can save the data. It also puts the WebClient instance back into the queue so that it can be re-used.

Again, this is a fairly simple approach, but it's very effective. It takes advantage of the asynchronous capabilities of HttpWebRequest (which is what WebClient uses to do its thing). With this approach you don't end up with 10 or more threads executing all the time. Instead, the thread pool spins up and uses only as many threads as required to read the data and execute your callback. If you use TPL or some other explicit multithreading technique, you end up with a bunch of threads that spend most of their time doing nothing while waiting for connections, etc.

You'll have to play with the number of concurrent downloads (i.e. the number of WebClient instances you have in your queue). How many you can support depends mostly on the speed of your Internet connection. It will also depend on the average latency of DNS requests, which can be surprisingly long, and on how many different domains you're downloading from.

One other caution when using a multithreaded approach is that of politeness. If all 10,000 of those URLs are from the same domain, you do not want to be hitting it with 10 simultaneous requests. They'll likely think you're trying to perpetrate a DOS attack, and block you. If those URLs are from just a handful of domains, you'll need to throttle your connections. If you only have a handful of URLs from any one particular domain, then this isn't a problem.

Upvotes: 4

Daniel Mošmondor
Daniel Mošmondor

Reputation: 19956

Of course you'll use multiple threads to download the content, and queue it, and so on. However, you won't be able to download it as fast as possible without some EXPERIMENTING, because there are plenty of factors that will affect your decision on how many concurrent threads to run. To name the few:

  • size of each html file
  • your download and upload bandwidth
  • server speed
  • latency of your connection (ping time) and number of router hops in between
  • and so on...

The fact is; TCP that is under the HTTP protocol has it's own life, and sometimes you'll be best to download one file at a time, and sometimes it will be best to download 1000 at a time.

BTW, if your server supports http keepalive, which is absolutely common thing today, it would be best to download files sequentially. You will have an open connection to the server, will send request, get content, send another request, get content and so on. If some of the factors above are suitable, using multiple threads won't get you as much as 10% improvement over this simple method.

Have in mind that if server isn't serving static pages, but are generated from the database, you'll also generate a load on the server that will actually slow down the download if you have multiple threads.

And so on...

Upvotes: 0

Gayot Fow
Gayot Fow

Reputation: 8792

I have had the exact same problem, and implemented a throttle strategy and the C# 4.0 Task Factory to solve it. In this design, all downloads take place in a separate thread which is throttled by a static instance of a Semaphore. The Semaphore limits are set in the app.config...

public class Downloader
{
    private static readonly Semaphore DownloadThrottle = 
                   new Semaphore(Properties.Settings.Default.ThrottleCount,     
                   Properties.Settings.Default.ThrottleCount);
    public Task<bool> DownloadTaskThread { get; set; }
    private string _url;
    private string _localFileName;
    public void Get(string url, string localFileName)
    {
        _url = url;
        _localFileName = localFileName;
        DownloadTaskThread = new Task<bool>(Worker);
        DownloadTaskThread.Start();
    }
    private bool Worker()
    {
        try
        {
            DownloadThrottle.WaitOne(); 
            using (WebClient wc = new WebClient())
            {
                wc.DownloadFile(_url, _localFileName);
                return true;
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.Message);
        }
        finally
        {
            DownloadThrottle.Release();
        }
        return false;
    }
}

In this class, the 'Get' method is invoked and it starts a thread to perform the work. The client class keeps a list of the 'DownloadTaskThread' instances and polls the results. When the number of completed tasks reach a tolerable level, more threads are started.

The throttle in the Worker and the throttle in the main class need to be 'tuned' so that ther's a balance in getting the work done and overall responsiveness of the application.

Using a Semaphore Semaphore assures that only a certain amount of work is performed at a time, keeping the UI responsive.

Using the Task class, you get access to how the operation completed, i.e., successfully or not. The Task instance is strongly typed, so you can use any sort of result that fits your application. In my example, it returns a bool. This is a great advantage over the previous versions of C# that offered only the ThreadPool.QueueUserWorkItem (for which there is no returned value). The generic class is documented here

Upvotes: 1

Eugen Rieck
Eugen Rieck

Reputation: 65254

My approach would be:

  • Put your links into a queue
  • Start N threads, with N being configurable
  • In each thread loop over Unqueue a link If this fails (queue empty) quit thread Download link
  • On your main thread do some UI work - either GUI or on the console write something like "Running, x/y files downloaded" every some seconds
  • On your main thread if queue is empty, Join() other threads

Upvotes: 0

Osman Turan
Osman Turan

Reputation: 1371

multithread or multiple backgroundworker instance are technically address same things. Basically you can use a thread pool such usages, or even better use Parallel.ForEach. If you need maximum performance, make sure you minimize synchronization data between threads and/or GUI.

Upvotes: 0

DOK
DOK

Reputation: 32831

In general, multithreading makes use of multiple threads (possibly on multiple cores on the server CPU) to process the code in parallel while the application waits for the code to return. An excellent tool for multithreading in .Net was introduced in version 3.5 and is called the Parallel Extension Library. One of the biggest performance improvements from multithreading is when you use it in repetitious processes inside loops.

Background workers are different. They may or may not be multithreaded. The point here is that the application code does not pause and wait for the background process to complete. Rather, the main code can continue processing. There is a separate method called when the background process returns its result. Here is an explanation from MSDN:

The BackgroundWorker class allows you to run an operation on a separate, dedicated thread. Time-consuming operations like downloads and database transactions can cause your user interface (UI) to seem as though it has stopped responding while they are running. When you want a responsive UI and you are faced with long delays associated with such operations, the BackgroundWorker class provides a convenient solution.

Upvotes: 0

Tudor
Tudor

Reputation: 62439

If you have a GUI, then the recommended procedure is to use the BackgroundWorker.

Upvotes: 0

Related Questions