Steve Konves
Steve Konves

Reputation: 2668

Concurrent web request performance issues

I am working on a new service to run QA for our companies' multiple web properties, and have run into an interesting network concurrency issue. To increase performance, I am using the TPL to create HttpWebRequests based from a large collection of urls so that they can run in parallel; however, I can't seem to find where the bottleneck is in the process.

My observations so far:

Possible pain points:

So the question is:

Obviously there is now way to download the entire internet in a matter of minutes, but I am interested to know where the bottleneck is in a scenario like this and what, if anything, can be done to overcome it.

As a side note, we are currently using a 3rd party service for crawling, but we are limited by them in some ways and would like more flexibility. Something about corporate secret sauce or poison on the tip of the arrow ... :)

Upvotes: 5

Views: 2063

Answers (3)

Marcel N.
Marcel N.

Reputation: 13976

The code is really very simple. I use Parallel.ForEach to loop through a collection of URLs (strings). The action creates an HttpWebRequest and then dumps the results into a ConcurrentBag. BTW, NCrawler seems interesting; I'll check it out. Thanks for the tip.

Because with Parallel.ForEach is impossible to control the number of threads,then I suggest at least switching to a ThreadPool.

You can use QueueUserWorkItem to allocate work until your task collection is completely pushed to worker threads or until the method returns false (no more threads in pool).

With ThreadPool you can control the maximum number of threads to be allocated with SetMaxThreads.

Upvotes: 1

Ilya Kozhevnikov
Ilya Kozhevnikov

Reputation: 10432

Maybe you're hitting TCP connections limit, or not disposing of connections properly, in any case try using something like JMeter to see the max concurrent HTTP throughput you can get.

Upvotes: 1

usr
usr

Reputation: 171178

I strongly suspect one of the following is the cause:

  1. You are running into the default connection limit. Check the value of ServicePointManager.DefaultConnectionLimit. I recommend you set it to a practically infinite value such as 1000.
  2. The TPL is not starting as many threads as are necessary to saturate the network. Notice, that remote web servers can have a large amount of latency. While waiting, your thread is not putting load on the network.

The TPL does not guarantee you any minimum degree of parallelism (DOP). That is a pity because sometimes you really need to control the degree of parallelism exactly when working with IO.

I recommend you manually start a fixed number of threads to do your IO because that is the only way to guarantee a specific DOP. You need to experiment with the exact value. It could be in the range of 50 to 500. You can reduce the default stack size of your threads to save memory with that many threads.

Upvotes: 7

Related Questions