Reputation: 2668
I am working on a new service to run QA for our companies' multiple web properties, and have run into an interesting network concurrency issue. To increase performance, I am using the TPL to create HttpWebRequests based from a large collection of urls so that they can run in parallel; however, I can't seem to find where the bottleneck is in the process.
My observations so far:
Possible pain points:
So the question is:
Obviously there is now way to download the entire internet in a matter of minutes, but I am interested to know where the bottleneck is in a scenario like this and what, if anything, can be done to overcome it.
As a side note, we are currently using a 3rd party service for crawling, but we are limited by them in some ways and would like more flexibility. Something about corporate secret sauce or poison on the tip of the arrow ... :)
Upvotes: 5
Views: 2063
Reputation: 13976
The code is really very simple. I use Parallel.ForEach to loop through a collection of URLs (strings). The action creates an HttpWebRequest and then dumps the results into a ConcurrentBag. BTW, NCrawler seems interesting; I'll check it out. Thanks for the tip.
Because with Parallel.ForEach is impossible to control the number of threads,then I suggest at least switching to a ThreadPool
.
You can use QueueUserWorkItem
to allocate work until your task collection is completely pushed to worker threads or until the method returns false (no more threads in pool).
With ThreadPool
you can control the maximum number of threads to be allocated with SetMaxThreads
.
Upvotes: 1
Reputation: 10432
Maybe you're hitting TCP connections limit, or not disposing of connections properly, in any case try using something like JMeter to see the max concurrent HTTP throughput you can get.
Upvotes: 1
Reputation: 171178
I strongly suspect one of the following is the cause:
The TPL does not guarantee you any minimum degree of parallelism (DOP). That is a pity because sometimes you really need to control the degree of parallelism exactly when working with IO.
I recommend you manually start a fixed number of threads to do your IO because that is the only way to guarantee a specific DOP. You need to experiment with the exact value. It could be in the range of 50 to 500. You can reduce the default stack size of your threads to save memory with that many threads.
Upvotes: 7