Reputation: 26766
I've got a job processor which needs to handle ~300 jobs in parallel (jobs can take up to 5 minutes to complete, but they are usually network-bound).
The issue I've got is that jobs tend to come in clumps of a specific type. For simplicity, let's say there are six job types, JobA
through JobF
.
JobA
- JobE
are network bound and can quite happily have 300 running together without taxing the system at all (actually, I've managed to get more than 1,500 running side-by-side in tests). JobF
(a new job type) is also network-bound, but it requires a considerable chunk of memory and actually uses GDI functionality.
I'm making sure I carefully dispose of all GDI objects with using
s and according to the profiler, I'm not leaking anything. It's simply that running 300 JobF
in parallel uses more memory than .NET is willing to give me.
What's the best practice way of dealing with this? My first thought was to determine how much memory overhead I had and throttle spawning new jobs as I approach the limit (at least JobF
jobs). I haven't been able to achieve this as I can't find any way to reliably determine what the framework is willing to allocate me in terms of memory. I'd also have to guess at the maximum memory used by a job which seems a little flakey.
My next plan was to simply throttle if I get OOMs and re-schedule the failed jobs. Unfortunately, the OOM can occur anywhere, not just inside the problematic jobs. In fact, the most common place is the main worker thread which manages the jobs. As things stand, this causes the process to do a graceful shutdown (if possible), restart and attempt to recover. While this works, it's nasty and wasteful of time and resources - far worse than just recycling that particular job.
Is there a standard way to handle this situation (adding more memory is an option and will be done, but the application should handle this situation properly, not just bomb out)?
Upvotes: 3
Views: 1899
Reputation: 56477
it's simply that running 300 JobF in parallel uses more memory than .Net is willing to give me.
Well then, just don't do this. Queue up your jobs in the system ThreadPool. Or, alternatively, scale-out and distribute the load to more systems.
Also, take a look at CERs to at least run cleanup code if an out of memory exceptions happens.
UPDATE: Another thing to be aware of, since you mentioned you use GDI, is that it can throw an OutOfMemoryException
for things that are not out of memory conditions.
Upvotes: 2
Reputation: 19956
I am doing something remotely similar to your case, and I opted for the approach in which I have ONE task processor (main queue manager that runs on ONE node) and as much AGENTS that run on one or more nodes.
Each of the agents run as a separate process. They:
Queue manager is designed in a way so if any agent fails during execution of the job, it will be simply re-tasked to another agent after some time.
BTW, consider NOT having all the tasks run at once in parallel, since there really is some overhead (it might be substantial) in switching context. In your case, you might be saturating network with an unnecessary PROTOCOL traffic instead real DATA traffic.
Another fine point of this design is that if I start to fall behind on data processing, I can always turn on one more machine (say Amazon C2 instance) and run several agents more that will help complete the task-base more quickly.
In answer to your question:
Every host will take as much as it can, since there are finite number of agents that run on one host. When one task is DONE, another is taken and ad infinitum. I don't use database. Tasks aren't time-critical, so I have one process that goes round and round on the incoming data-set and creates new tasks if something failed in previous run(s). Concretely:
http://access3.streamsink.com/archive/ (source data)
http://access3.streamsink.com/tbstrips/ (calculated results)
On each queue manager run, source and destination are scanned, resulting sets subtracted and filenames turned into tasks.
And still some more:
I am using web services to get job info/returning results and simple http to get the data for processing.
Finally:
This is simpler of the 2 manager/agent pairs that I have - other one is somehow more complicated so I won't go into detail of it here. Use the e-mail :)
Upvotes: 2
Reputation: 45096
Ideally could partition into process-profile. CPU bound, memory bound, IO bound, net work bound. I am a rookie at parallel processing but what TPL does well is CPU bound and cannot really tune much past MaxDegreeOfParallelism.
A start is CPU bound get MaxDegreeOfParallelism = System.Environment.ProcessorCount -1
And every thing else gets MaxDegreeOfParallelism = 100. I know you said the network stuff will scale bigger but at some point the limit is your bandwidth. Is spinning up 300 jobs (that eat memory) really giving you more throughput? If so look at the answer from Joradao.
Upvotes: 1
Reputation: 2797
If your objects implements IDisposable interface you should not rely on Garbage Recollector because that could produce a memory leak.
For example, if you have that class:
class Mamerto : IDisposable
{
public void methodA()
{
// do something
}
public void methodB()
{
// do something
}
public void Dispose()
{
// release resources
}
And you use that class in that way:
using( var m = new Mamerto() )
{
m.methodA();
m.methodB();
// you should call dispose here!
}
The Garbage recollector will mark m object as "ready to delete", putting it on Gen 0 collection. When the Garbage recollector tryes to delete all the objects on Gen 0, detects the Dispose Method and automatically promote the object to Gen 1 (because it's not "so easy" to delete that object). Gen 1 objects are not checked as often as Gen 0 objects, then, could lead to a memory leak.
Please read that article for further info http://msdn.microsoft.com/en-us/magazine/bb985011.aspx
If you proceed with an explicid Dispose, then you can avoid that annoying leak.
using( var m = new Mamerto() )
{
m.methodA();
m.methodB();
m.Dispose();
}
Upvotes: 0