Basic
Basic

Reputation: 26766

Handling an out of memory exception gracefully in .NET (or avoiding it entirely)

I've got a job processor which needs to handle ~300 jobs in parallel (jobs can take up to 5 minutes to complete, but they are usually network-bound).

The issue I've got is that jobs tend to come in clumps of a specific type. For simplicity, let's say there are six job types, JobA through JobF.

JobA - JobE are network bound and can quite happily have 300 running together without taxing the system at all (actually, I've managed to get more than 1,500 running side-by-side in tests). JobF (a new job type) is also network-bound, but it requires a considerable chunk of memory and actually uses GDI functionality.

I'm making sure I carefully dispose of all GDI objects with usings and according to the profiler, I'm not leaking anything. It's simply that running 300 JobF in parallel uses more memory than .NET is willing to give me.

What's the best practice way of dealing with this? My first thought was to determine how much memory overhead I had and throttle spawning new jobs as I approach the limit (at least JobF jobs). I haven't been able to achieve this as I can't find any way to reliably determine what the framework is willing to allocate me in terms of memory. I'd also have to guess at the maximum memory used by a job which seems a little flakey.

My next plan was to simply throttle if I get OOMs and re-schedule the failed jobs. Unfortunately, the OOM can occur anywhere, not just inside the problematic jobs. In fact, the most common place is the main worker thread which manages the jobs. As things stand, this causes the process to do a graceful shutdown (if possible), restart and attempt to recover. While this works, it's nasty and wasteful of time and resources - far worse than just recycling that particular job.

Is there a standard way to handle this situation (adding more memory is an option and will be done, but the application should handle this situation properly, not just bomb out)?

Upvotes: 3

Views: 1899

Answers (4)

Jordão
Jordão

Reputation: 56477

it's simply that running 300 JobF in parallel uses more memory than .Net is willing to give me.

Well then, just don't do this. Queue up your jobs in the system ThreadPool. Or, alternatively, scale-out and distribute the load to more systems.

Also, take a look at CERs to at least run cleanup code if an out of memory exceptions happens.

UPDATE: Another thing to be aware of, since you mentioned you use GDI, is that it can throw an OutOfMemoryException for things that are not out of memory conditions.

Upvotes: 2

Daniel Mošmondor
Daniel Mošmondor

Reputation: 19956

I am doing something remotely similar to your case, and I opted for the approach in which I have ONE task processor (main queue manager that runs on ONE node) and as much AGENTS that run on one or more nodes.

Each of the agents run as a separate process. They:

  • check for task availability
  • download required data
  • process data
  • upload result

Queue manager is designed in a way so if any agent fails during execution of the job, it will be simply re-tasked to another agent after some time.

8 agents running side-by-side in one box

BTW, consider NOT having all the tasks run at once in parallel, since there really is some overhead (it might be substantial) in switching context. In your case, you might be saturating network with an unnecessary PROTOCOL traffic instead real DATA traffic.

Another fine point of this design is that if I start to fall behind on data processing, I can always turn on one more machine (say Amazon C2 instance) and run several agents more that will help complete the task-base more quickly.

In answer to your question:

Every host will take as much as it can, since there are finite number of agents that run on one host. When one task is DONE, another is taken and ad infinitum. I don't use database. Tasks aren't time-critical, so I have one process that goes round and round on the incoming data-set and creates new tasks if something failed in previous run(s). Concretely:

http://access3.streamsink.com/archive/ (source data)

http://access3.streamsink.com/tbstrips/ (calculated results)

On each queue manager run, source and destination are scanned, resulting sets subtracted and filenames turned into tasks.

And still some more:

I am using web services to get job info/returning results and simple http to get the data for processing.

Finally:

This is simpler of the 2 manager/agent pairs that I have - other one is somehow more complicated so I won't go into detail of it here. Use the e-mail :)

Upvotes: 2

paparazzo
paparazzo

Reputation: 45096

Ideally could partition into process-profile. CPU bound, memory bound, IO bound, net work bound. I am a rookie at parallel processing but what TPL does well is CPU bound and cannot really tune much past MaxDegreeOfParallelism.

A start is CPU bound get MaxDegreeOfParallelism = System.Environment.ProcessorCount -1

And every thing else gets MaxDegreeOfParallelism = 100. I know you said the network stuff will scale bigger but at some point the limit is your bandwidth. Is spinning up 300 jobs (that eat memory) really giving you more throughput? If so look at the answer from Joradao.

Upvotes: 1

Jordi
Jordi

Reputation: 2797

If your objects implements IDisposable interface you should not rely on Garbage Recollector because that could produce a memory leak.

For example, if you have that class:

class Mamerto : IDisposable
{
    public void methodA()
    {
        // do something
    }

    public void methodB()
    {
        // do something
    }

    public void Dispose()
    {
        // release resources
    }

And you use that class in that way:

using( var m = new Mamerto() )
{
    m.methodA();
    m.methodB();
    // you should call dispose here!
}

The Garbage recollector will mark m object as "ready to delete", putting it on Gen 0 collection. When the Garbage recollector tryes to delete all the objects on Gen 0, detects the Dispose Method and automatically promote the object to Gen 1 (because it's not "so easy" to delete that object). Gen 1 objects are not checked as often as Gen 0 objects, then, could lead to a memory leak.

Please read that article for further info http://msdn.microsoft.com/en-us/magazine/bb985011.aspx

If you proceed with an explicid Dispose, then you can avoid that annoying leak.

using( var m = new Mamerto() )
{
    m.methodA();
    m.methodB();
    m.Dispose();
}

Upvotes: 0

Related Questions