Reputation: 5029
I need to scrape data from a website. I have over 1,000 links I need to access, and previously I was dividing the links 10 per thread, and would start 100 threads each pulling 10. After few test cases, 100 threads was the best count to minimize the time it retrieved the content for all the links.
I realized that .NET 4.0 offered better support for multi-threading out of the box, but this is done based on how many cores you have, which in my case does not spawn enough threads. I guess what I am asking is: what is the best way to optimize the 1,000 link pulling. Should I be using .ForEach
and let the Parallel
extension control the amount threads that get spawned, or find a way to tell it how many threads to start and divide the work?
I have not worked with Parallel
before so maybe my approach maybe wrong.
Upvotes: 5
Views: 17740
Reputation: 1761
you can use MaxDegreeOfParallelism property in Parallel.ForEach to control the number of threads that will be spawned.
Heres the code snippet -
ParallelOptions opt = new ParallelOptions();
opt.MaxDegreeOfParallelism = 5;
Parallel.ForEach(Directory.GetDirectories(Constants.RootFolder), opt, MyMethod);
Upvotes: 6
Reputation: 71956
Something worth checking out is the TPL Dataflow library.
DataFlow on MSDN.
See Nesting await in Parallel.ForEach
The whole idea behind Parallel.ForEach() is that you have a set of threads and each processes part of the collection. As you noticed, this doesn't work with async-await, where you want to release the thread for the duration of the async call.
Also, the walkthrough Creating a Dataflow Pipeline specifically sets up and processes multiple web page downloads. TPL Dataflow really was designed for that scenario.
Upvotes: 3
Reputation: 317
Hard to say without looking at your code and how the collection is defined, I've found that Parallel.Invoke
is the most flexible. try msdn? ... sounds like you are looking to use Parallel.For Method (Int32, Int32, Action<Int32, ParallelLoopState>)
Upvotes: 0
Reputation: 150238
In general, Parallel.ForEach()
is quite good at optimizing the number of threads. It accounts for the number of cores in the system, but also takes into account what the threads are doing (CPU bound, IO bound, how long the method runs, etc.).
You can control the maximum degree of parallelization, but there's no mechanism to force more threads to be used.
Make sure your benchmarks are correct and can be compared in a fair manner (e.g. same websites, allow for a warm-up period before you start measuring, and do many runs since response time variance can be quite high scraping websites). If after careful measurement your own threading code is still faster, you can conclude that you have optimized for your particular case better than .NET and stick with your own code.
Upvotes: 4