Reputation: 396
I tried to find something online with regards to this but there doesn't seem to be a definite answer. I just have my own reasoning and would like to know what is the best way.
My application runs through a long list of files(about 100-200) and does some calculations on the data inside of them. Each file takes a few minutes to process.
I originally planned on creating Tasks based on the number of cores in the processor.
So if there are 4 cores, then I would create 3 tasks and have each task process 1/3 of the files.
My reading has told me that the thread pool manages all task and according creates threads for it based on a variety of factors.(in simple terms?)
Would it then be better for me to simply create a task for each file and allow the thread pool to decide what is best?
Any info, suggestion would be very welcome! Thanks
EDIT: All the files are about 5MB, and the calculations/analysis of the data in the files is very processor heavy.
Upvotes: 3
Views: 107
Reputation: 116548
200 files isn't such a long list, but I would still recommend against flooding the ThreadPool with pending tasks.
You can use TPL Dataflow's ActionBlock for this. You create the block, give it an action to perform on each item and limit the parallelism to whatever you want.
Example in C#:
var block = new ActionBlock<string>(async fileName =>
{
var data = await ReadFileAsync(fileName);
ProcessData(data);
}, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 50 });
foreach (var fileName in fileNames)
{
block.Post(fileName);
}
block.Complete();
await block.Completion;
Since it's not just a CPU bound operation you should use a higher number than the available CPUs. Consider using a config file so you can change it according to actual performance.
Upvotes: 2
Reputation: 171178
based on a variety of factors
That is a key point. It is unpredictable (to me) how many threads will actually be running for non-CPU bound work under full load. The .NET thread pool heuristics are very volatile (subjectively: insane) and should not be relied upon.
allow the thread pool to decide what is best
It can't know. It is (mostly) good at scheduling CPU-bound work but it can't find the optimal degree of parallelism for IO bound work.
Use PLINQ:
myFiles
.AsParallel().WithDOP(optimalDopHere)
.ForAll(x => Process(x));
Determine the optimal degree of parallelism empirically.
If this is purely CPU-bound work you can get away with pretty much any parallel construct, probably Parallel
or still PLINQ.
Upvotes: 2