Reputation: 5259
I have a tree data structure, whereby each of the siblings can be processed in parallel to one another.
Currently I create a separate task to process each child. Is this naive and would it hurt performance once the tree is of a given size/depth?
Or is the CLR designed to handle an arbitrary load of tasks because they are not bound to particular OS threads?!
Upvotes: 0
Views: 88
Reputation: 131364
When asking a question the "why" can be as important as the what and how. From the (lost) comments :
Well it’s an AST so if we have a bunch of files, they could have 100s of subtrees each. The actual processing isn’t complex for each node but the volume could be in the 100s of concurrent tasks.
This should be part of the question, along with a description of the processing required.
Asking about child tasks is like asking what action to embed in the parser rule: Why embed actions to the parser rules at all? That's just one technique. Whether it's suitable or not depends on what you are doing. A visitor that works on the AST may be better. Or a listener that reacts to specific rules. ANTLR for example offers all three options. Which one you pick depends on the job
Just like parsing, the kind of job matters in parallel computing too.
When you have a lot of data creating more tasks than cores simply wastes time. It's better to create fewer tasks and split the data among them. This way, a single task can process all of its assigned data without thread switching and the delays involved.
This is called data parallelism. TPL supports it through the Parallel
class and Parallel LINQ. If you have an IEnumerable with the data that needs processing you can process them in parallel with eg :
Parallel.ForEach(myCollection,singleItem=>SomePrecessing(singleItem));
Parallel.ForEach
will create (roughly) as many tasks as cores, partition the data and send each part to a core.
Parallel LINQ allows you to execute each operator of a LINQ query in parallel, simply by adding an .AsParallel()
call, eg :
var results = from file in listOfFiles.AsParallel()
from stock in ParseFileToGetStocks(file)
where stock.Price >100
group stock by stock.Category into g
select new {category=g.Key,Max=g.Max()....}
The parsing, filtering, grouping, aggregation parts will run in separate, parallel steps.
As long as you can create an IEnumerable from your tree, eg with an iterator, you can use it with Parallel.For/ForEach or PLINQ.
That's just one option though, that may not be suitable to this problem. After all, part of it is reading a lot of files, an IO operation. Why not read and process the files in separate steps, just like eg a Powershell or bash pipeline of individual commands?
That's a dataflow problem, supported by the TPL Dataflow library. You can break the job in separate blocks, each of which runs on its own task(s). One block could load and parse a file, a second one could request some data from an external service, a final one could process the results.
Let's say the files only contain some stock data and you have to request more from an external service. The first block could parse the files and extract the stocks, the second could request the additional data and a final step could process all the data. Creating a dataflow would allow all jobs to run in parallel. A network-bound download could happen at the same time a CPU-heavy processing step worked on the results for another file, eg:
var parserBlock=new TranformManyBlock<string,StockData>(file =>{
{
var stock=Parse(file);
foreach(var stock in stocks)
{
yield return new StockData(...);
}
});
var downloader=new TransformBlock<StockData,CompleteData>(stock =>
{
var extraData=someService.Get(stock.Symbol, stock.Date....);
return new CompleteData(stock,extraData);
});
var calculateBlock= new ActionBlock<CompleteData>(stock=>
{
var results=HeavyProcessing(stock);
WriteResults(stock,results);
});
var linkOptions=new DataflowLinkOptions{PropagateCompletion=true";
parserBlock.LinkTo(downloader,linkOptions);
downloader.LinkTo(calculateBlock,linkOptions);
Once you have a pipeline of blocks you can start posting data to it:
foreach(var node in tree)
{
parserBlock.Post(node.File);
}
When you are done, you tell the first block to Complete()
and await until all blocks up to the final one have finished :
parserBlock.Complete();
await calculateBlock.Completion;
Since the downloader will just wait for response from the service, you can specify that eg up to 5 downloads will run at the same time:
var parallelOptions=new ExecutionDataflowBlockOptions{MaxDegreeOfParallelism=5};
var downloader=new TransformBlock<>(...,parallelOptions);
Upvotes: 1
Reputation: 10929
The big answer for this question is yes. But for in order to understand it completely, you will have to look deeper into tasks and see if your tasks are going to be long-running tasks or short running.
the second link, contains the answer to your question but to sum things up for you:
Tasks are short-running by default. What it means:
The task will be created inside the thread pool. If your tasks are going to do work that will consume a lot of time, meaning long-running - you should avoid creating the tasks inside the thread pool as it will fill your thread pool with more and more tasks and eventually will overflow, let's call it, the thread pool.
Long running means the task will be created on a single thread outside of the thread pool, it can be defined using TaskCreationOptions Enumeration:
TaskCreationOptions.LongRunning
Upvotes: 1