Bobson
Bobson

Reputation: 13696

Parallelizing data processing

I'm trying to improve the runtime of some data processing I'm doing. The data starts out as various collections (Dictionary mostly, but a few other IEnumerable types), and the end result of processing should be a Dictionary<DataType, List<DataPoint>>.

I have all this working just fine... except it takes close to an hour to run, and it needs to run in under 20 minutes. None of the data has any connection to any other from the same collection, although they cross-reference other collections frequently, so I figured I should parallelize it.

The main structure of the processing has two levels of loops with some processing in between:

// Custom class, 0.01%
var primaryData= GETPRIMARY().ToDictionary(x => x.ID, x => x);

// Custom class, 11.30%
var data1 = GETDATAONE().GroupBy(x => x.Category)
                        .ToDictionary(x => x.Key, x => x);  

// DataRows, 8.19%
var data2 = GETDATATWO().GroupBy(x => x.Type)
                        .ToDictionary(x => x.Key, x => x.OrderBy(y => y.ID));

foreach (var key in listOfKeys)
{
   // 0.01%
   var subData1 = data1[key].ToDictionary(x => x.ID, x => x);

   // 1.99%
   var subData2 = data2.GroupBy(x => x.ID)
                       .Where(x => primaryData.ContainsKey(x.Type))
                       .ToDictionary(x => x.Key, x => ProcessDataTwo(x, primaryData[x.Key]));

   // 0.70%
   var grouped = primaryData.Select(x => new { ID = x.Key, 
                                               Data1 = subData1[x.Key],
                                               Data2 = subData2[x.Key] }).ToList();
   foreach (var item in grouped)
   {
       // 62.12%
       item.Data1.Results = new Results(item.ID, item.Data2);
       // 12.37%
       item.Data1.Status = new Status(item.ID, item.Data2);
   }
   results.Add(key, grouped);
}
return results;

listOfKeys is very small, but each grouped will have several thousand items. How can I structure this so that each call to item.Data1.Process(item.Data2) can get queued up and executed in parallel?

According to my profiler, all the ToDictionary() calls together take up about 21% of the time, the ToList() takes up 0.7%, and the two items inside the inner foreach together take up 74%. Hence why I'm focusing my optimization there.

I don't know if I should use Parallel.ForEach() to replace the outer foreach, the inner one, both, or if there's some other structure I should use. I'm also not sure if there's anything I can do to the data (or the structures holding it) to improve parallel access to it.

(Note that I'm stuck on .NET4, so don't have access to async or await)

Upvotes: 2

Views: 378

Answers (2)

Scott Chamberlain
Scott Chamberlain

Reputation: 127543

Based on the percentages you posted and you said that grouped was very large you would definitely benefit by only paralyzing the inner loop.

Doing it is fairly simple to do

   var grouped = primaryData.Select(x => new { ID = x.Key, 
                                               Data1 = subData1[x.Key],
                                               Data2 = subData2[x.Key] }).ToList();
   Parallel.ForEach(grouped, (item) => 
   {
       item.Data1.Results = new Results(item.ID, item.Data2);
       item.Data1.Status = new Status(item.ID, item.Data2);
   });

   results.Add(key, grouped);

This assumes that new Results(item.ID, item.Data2); and new Status(item.ID, item.Data2); are safe to do multiple initializations at once (the only concern I would have is if they access non-thread safe static resources internally, and even so a non thread safe constructor is a really bad design flaw)


There is one big cavat: This will only help if you are CPU bound. If Results or Status is IO bound (for example it is waiting on a database call or a file on the hard drive) doing this will hurt your performance instead of helping it. If you are IO bound instead of CPU bound the only options are to buy faster hardware, attempt optimize those two methods more, or use caching in memory if possible so you don't need to do the slow IO.

Upvotes: 1

spender
spender

Reputation: 120380

EDIT

Given the time measurements provided after I wrote this answer, it appears that this approach was looking for savings in the wrong places. I'll leave my answer as a warning against optimization without measurement!!!


So, because of the nestedness of your approach, you are causing some unnecessary over-iteration of some of your collections leading to rather nasty Big O characteristics.

This can be mitigated by using the ILookup interface to pre-group collections by a key and to use these instead of repeated and expensive Where clauses.

I've had a stab at re-imagining your code to reduce complexity (but it is somewhat abstract):

var data2Lookup = data2.ToLookup(x => x.Type);
var tmp1 = 
    listOfKeys
        .Select(key => 
            new {
                key, 
                subData1 = data1[key], 
                subData2 = data2Lookup[key].GroupBy(x=>x.Category)
            })
        .Select(x => 
            new{
                x.key, 
                x.subData1, 
                x.subData2, 
                subData2Lookup = x.subData2.ToLookup(y => y.Key)});
var tmp2 = 
    tmp1
        .Select(x => 
            new{
                x.key, 
                grouped = x.subData1
                            .Select(sd1 => 
                                new{
                                    Data1 = sd1, 
                                    Data2 = subData2Lookup[sd1]
                                })
            });
var result =
    tmp2
        .ToDictionary(x => x.key, x => x.grouped);

It seems to me that the processing is somewhat arbitrarily place midway through the building of results, but shouldn't affect it, right?

So once results is built, let's process it...

var items = result.SelectMany(kvp => kvp.Value);
for(var item in items)
{
    item.Data1.Process(item.Data2);
}

EDIT

I've deliberately avoided going parallel fttb, so if you can get this working, there might be further speedup by adding a bit of parallel magic.

Upvotes: 1

Related Questions