Eric Walker
Eric Walker

Reputation: 7571

Memory limit for a single instance exceeded during start of appengine-mapreduce job

I'm trying to use appengine-mapreduce to prepare data for loading into BigQuery and am running into a memory limit. I'm using CloudStorage, so perhaps the accepted response in this question does not apply.

What I'm seeing is that a single VM instance, which appears to be the coordinator for the overall mapper task, is exceeding the ~ 1GB allocated to it and then being killed before any workers start. In this screenshot, there are three instances, and only the top one is growing in memory: enter image description here

In several earlier attempts, there were as many as twelve instances, and all but one were well under the memory limit, and only one was reaching the limit and being killed. This suggests to me that there's not a general memory leak (as might be addressed with Guido van Rossum's suggestion in the earlier question to periodically call gc.collect()), but instead a mismatch between the size and number of files being processed and working assumptions about file size and count in the appengine-mapreduce code.

In the above example, twelve .zip files are being handed to the job for processing. The smallest .zip file is about 12 MB compressed, and the largest is 45 MB compressed. Following is the configuration I'm passing to the MapperPipeline:

output = yield mapreduce_pipeline.MapperPipeline(
    info.job_name,
    info.mapper_function,
    'mapreduce.input_readers.FileInputReader',
    output_writer_spec='mapreduce.output_writers.FileOutputWriter',
    params={
        'input_reader': {
            'files': gs_paths,
            'format': 'zip[lines]',
        },
        'output_writer': {
            'filesystem': 'gs',
            'gs_bucket_name': info.results_dirname,
            'output_sharding': 'none',
        },
    },
    shards=info.shard_count)

The value of shard_count is 16 in this case.

Is this a familiar situation? Is there something straightforward I can do to avoid hitting the 1GB memory limit? (It is possible that the poster of this question, which has gone unanswered, was running into a similar issue.)

Upvotes: 1

Views: 591

Answers (1)

Eric Walker
Eric Walker

Reputation: 7571

I was able to get over the first hurdle, where the coordinator was being killed because it was running out of memory, by using files on the order of ~ 1-6MB and upping the shard count to 96 shards. Here are some things I learned after that:

  • The shard count is not roughly equivalent to the number of instances, although it may be equivalent to the number of workers. At a later point, where I had up to 200+ shards, approx. 30 instances were spun up.
  • Memory management was not so relevant when the coordinator was being killed, but it was later on, when secondary instances were running out of memory.
  • If you call gc.collect() too often, there is a big degradation in throughput. If you call it too little, instances will be killed.
  • As I had guessed, there appears to be a complex relationship between the number of files to be processed, the individual file size, the number of shards that are specified, how often garbage collection takes place, and the maximum available memory on an instance, which all must be compatible in order to avoid running into memory limits.
  • The AppEngine infrastructure appears to go through periods of high utilization, where what was working before starts to fail due to HTTP timeouts in parts of the stack that are handled by appengine-mapreduce.

The memory-related tuning seems to be specific to the amount of data that one is uploading. I'm pessimistic that it's going to be straightforward to work out a general approach for all of the different tables that I'm needing to process, each of which has a different aggregate size. The most I've been able to process so far is 140 MB compressed (perhaps 1-2 GB uncompressed).

For anyone attempting this, here are some of the numbers I kept track of:

Chunk    Shards  Total MB  MB/shard  Time   Completed?  Total memory         Instances  Comments
daily    96      83.15     0.87      9:45   Yes         ?                    ?
daily    96      121       1.2       12.35  Yes         ?                    ?
daily    96      140       1.5       8.36   Yes         1200 MB (s. 400 MB)  24
daily    180     236       1.3       -      No          1200 MB              ?
monthly  32      12        0.38      5:46   Yes         Worker killed        4
monthly  110     140       1.3       -      No          Workers killed       4          Better memory management of workers needed.
monthly  110     140       1.3       -      No                               8          Memory was better, but throughput became unworkable.
monthly  32      140       4.4       -      No          Coordinator killed   -
monthly  64      140       2.18      -      No          Workers killed       -
daily    180     236       1.3       -      No          -                    -          HTTP timeouts during startup
daily    96      140       1.5       -      No          -                    -          HTTP timeouts during startup

Upvotes: 1

Related Questions