bbrez1
bbrez1

Reputation: 135

c# multithreaded downloading of webpages using proxies - performance issues

I'm making a multithreaded proxy checker and I have my own multithreading algorithm which basically starts up a bunch of threads (50+) with each thread connecting to a webpage and just simply downloading and checking the response. If the response contains a certain string I assume that the proxy is working. Now the problem occurs when each of the 50 threads tries to download the webpage at the same time. The webpage itself is 400kb in size so it can take sometime to download using proxies. When I'm doing this without proxies I get 50 results in almost no time. But with proxies I get results in "batches" of 2 or 3 and it's too slow. I'm using a simple WebClient object with a small timeout. I have a 100Mbit connection and it's only using about 10% of that. I tried to find a few solutions online with no luck. The multithreading part of the code works without any problems because I have used it in numerous projects before and it is polished out.

With some playing around I found out that all of the 50 threads eventually get to the same line of code (at the exact same time) which downloads the source of the page but then stop.

    result = webClient.DownloadString(url);

I added a simple before and after timer to this line to test how long the downloading takes. One would assume that it would not take anymore than 5 seconds (since that is the timeout). The timers are huge and just piling up (up to even 120seconds).. So I guess there is a limit somewhere of how many active connections can be alive. Since I have 50 threads running at the same time I also want to be downloading 50 pages at once and not wait for previous ones to finish.

I have tried using:

    System.Net.ServicePointManager.DefaultConnectionLimit = int.MaxValue;

with no luck however. This is my code:

    public class AwesomeWebClient : WebClient
    {
        protected override WebRequest GetWebRequest(Uri address)
        {
            WebRequest request = base.GetWebRequest(address);
            request.Timeout = 5000;
            HttpWebRequest webRequest = request as HttpWebRequest;
            return request;
        }
    }

    private static string Get(string url, string proxy, string UA)
    {
        string result = "";

        try
        {
            var webClient = new AwesomeWebClient();
            webClient.Headers.Add("Referer", "http://yahoo.com");
            webClient.Headers.Add("X-Requested-With", "XMLHttpRequest");
            webClient.Headers.Add("Accept", "*");
            webClient.Headers.Add("User-Agent", UA);
            webClient.Proxy = new WebProxy(proxy);
            result = webClient.DownloadString(url);
        }
        catch (Exception x)
        {
            //Console.WriteLine(x.Message + " | " + url);
        }

        return result;
    }

Upvotes: 1

Views: 859

Answers (1)

Jim Mischel
Jim Mischel

Reputation: 133950

There's a lot of stuff going on behind the scenes with WebClient, any one of which could be the bottleneck. After all, WebClient is just a convenient wrapper around HttpWebRequest. One thing that can cause problems here is DNS resolution, which can limit the number of concurrent requests you can make. Although I can't see it causing the kind of slowdown you describe.

But quite likely the problem is the threading. In your single-threaded model, you have one thread that gets one document at a time. That it can do very quickly. With 50 threads, you have the overhead of thread context switches. So one thread gets a few tens of kilobytes, but then it gets swapped out for the next thread. That context switch overhead is going to slow things down.

You should consider reducing the number of threads. What happens if you do this with two threads? How about four threads? If you limit the amount of thread context switching, you're going to speed up your program.

The other thing you could try is DownloadStringAsync, although even then you should limit the number of concurrent requests.

Finally, and I don't know if this is still true, but in the past it was much faster to create the WebClient once and use it for multiple files than it was to create a new WebClient for each download. That is, this code:

WebClient myClient = new WebClient();
foreach (var url in urlsList)
{
    myClient.DownloadString(url);
}

was significantly faster than this code:

foreach (var url in urlsList)
{
    WebClient myClient = new WebClient();
    myClient.DownloadString(url);
}

I never tracked down the reason why, and I've seen some people say that it's no longer the case. But I haven't tested it myself recently.

Upvotes: 1

Related Questions