Reputation: 5234
I'm processing a text file that contains employees and their dependents in a Parallel.ForEach
. Order can't be guaranteed in the file. In each iteration I'm creating an employee or dependent object. Employee objects are added to a ConcurrentDictionary
. Dependents are a property on the employee and need to be associated with the employee. The problem is if an employee is not in the dictionary, when I try to add the dependent, the dependent never gets added. I can cache these "orphaned" dependents and add them when the ForEach completes, but I think there's a better way.
Is there some way I can wait/spin/join in my Parallel.ForEach
until the employee is added, then add the dependent. I'm not married to this solution, so alternatives are welcome.
Here's my code (edited for brevity):
var cx = System.IO.File.ReadLines(path);
var _employeeDictionary = new ConcurrentDictionary<string, Employee>();
Parallel.ForEach(cx, _options, line =>
{
Employee employee = null;
switch (line.Substring(0, 2))
{
case EmployeeLine:
// Employee is created and added to dictionary....
_employeeDictionary.GetOrAdd(winID, employee);
break;
case DependentLine:
// Dependent is created
// WILL NOT BE ADDED IF THE EMPLOYEE HASN'T BEED ADDED YET
if (_employeeDictionary.TryGetValue(dependent.EmpID, out employee))
{
lock (locker)
{
employee.AddDependent(ci, dependent);
}
}
break;
}
});
Upvotes: 2
Views: 274
Reputation: 156624
The key to making concurrency work for you instead of against you is to make your code work more functionally. Instead of thinking in imperative terms ("do this for each line"), think in terms of how you can transform the data. Do your data transformations in parallel: if you're not modifying state, you're inherently thread-safe. Then use imperative programming for only the pieces that really need to be imperative.
var cx = System.IO.File.ReadAllLines(path);
var lineInfo = cx.AsParallel()
.Select(line => new {
lineCode = line.Substring(0, 2),
line
})
.ToList()
.AsParallel();
var employeeDictionary = lineInfo
.Where(e => e.lineCode == EmployeeLine)
.Select(e => ParseEmployee(e.line))
.ToDictionary(e => e.winId);
var dependentLookup = lineInfo
.Where(e => e.lineCode == DependentLine)
.Select(e => ParseDependent(e.line))
.ToLookup(d => d.EmpId);
Parallel.ForEach(employeeDictionary.Values, _options, employee =>
{
foreach(var dependent in dependentLookup[employee.winId])
{
// It's even better if you can have an "AddDependents" method
// to avoid the foreach, and leverage the efficiencies of "AddRange"-type
// methods.
employee.AddDependent(dependent);
}
});
It's also worth noting that it may not actually be worthwhile to do parallel processing in this code. I'd suggest benchmarking it with and without parallelism, and if you don't see a noticeable improvement, don't bother.
Upvotes: 2
Reputation: 10708
Assuming your number of lines exceeds the number of processors on the machine you're using, you won't lose a noticeable amount of performance by doing the parent set and then the child set. This could even avoid the overhead of checking for those values, particularly so since you have to lock the dictionary for such options.
Another option is to implement a UnknownEmployee, which derives from Employee and represents an entry for which dependents are known, but the employee is not. Anytime the dependent is created by the parent is unknown, create an UnknownEmployee entry and set it's dependent. Anytime an Employee is created, check if they key already exists, and if it does, make sure to copy over the Dependents before adding to the dictionary.
Upvotes: 0
Reputation: 273574
Is there some way I can wait/spin/join in my Parallel.ForEach
Yes, but they will all sink your performance and introduce the possibility of deadlock. So that isn't the way to go.
Of course that last if (Trygetvalue...))
needs an else
branch or you'll lose data. Store the Dependants in a list for processing after the first parallel.ForEach()
.
And then you might as well store all Dependents there, not bothering to look them up in the first run. Simpler and possibly even faster.
Upvotes: 4
Reputation: 525
You spelled out your own answer. They are orphaned because the employee just does not exist yet. You HAVE to wait until the end to ensure the employee is there. Locking will reduce performance. Just add the orphans to a concurrent list and parallel foreach on that list when the initial file read is done.
Upvotes: 1