Reputation: 7571
I'm trying to use appengine-mapreduce
to prepare data for loading into BigQuery and am running into a memory limit. I'm using CloudStorage, so perhaps the accepted response in this question does not apply.
What I'm seeing is that a single VM instance, which appears to be the coordinator for the overall mapper task, is exceeding the ~ 1GB allocated to it and then being killed before any workers start. In this screenshot, there are three instances, and only the top one is growing in memory:
In several earlier attempts, there were as many as twelve instances, and all but one were well under the memory limit, and only one was reaching the limit and being killed. This suggests to me that there's not a general memory leak (as might be addressed with Guido van Rossum's suggestion in the earlier question to periodically call gc.collect()
), but instead a mismatch between the size and number of files being processed and working assumptions about file size and count in the appengine-mapreduce
code.
In the above example, twelve .zip files are being handed to the job for processing. The smallest .zip file is about 12 MB compressed, and the largest is 45 MB compressed. Following is the configuration I'm passing to the MapperPipeline
:
output = yield mapreduce_pipeline.MapperPipeline(
info.job_name,
info.mapper_function,
'mapreduce.input_readers.FileInputReader',
output_writer_spec='mapreduce.output_writers.FileOutputWriter',
params={
'input_reader': {
'files': gs_paths,
'format': 'zip[lines]',
},
'output_writer': {
'filesystem': 'gs',
'gs_bucket_name': info.results_dirname,
'output_sharding': 'none',
},
},
shards=info.shard_count)
The value of shard_count
is 16 in this case.
Is this a familiar situation? Is there something straightforward I can do to avoid hitting the 1GB memory limit? (It is possible that the poster of this question, which has gone unanswered, was running into a similar issue.)
Upvotes: 1
Views: 591
Reputation: 7571
I was able to get over the first hurdle, where the coordinator was being killed because it was running out of memory, by using files on the order of ~ 1-6MB and upping the shard count to 96 shards. Here are some things I learned after that:
gc.collect()
too often, there is a big degradation in throughput. If you call it too little, instances will be killed.appengine-mapreduce
.The memory-related tuning seems to be specific to the amount of data that one is uploading. I'm pessimistic that it's going to be straightforward to work out a general approach for all of the different tables that I'm needing to process, each of which has a different aggregate size. The most I've been able to process so far is 140 MB compressed (perhaps 1-2 GB uncompressed).
For anyone attempting this, here are some of the numbers I kept track of:
Chunk Shards Total MB MB/shard Time Completed? Total memory Instances Comments
daily 96 83.15 0.87 9:45 Yes ? ?
daily 96 121 1.2 12.35 Yes ? ?
daily 96 140 1.5 8.36 Yes 1200 MB (s. 400 MB) 24
daily 180 236 1.3 - No 1200 MB ?
monthly 32 12 0.38 5:46 Yes Worker killed 4
monthly 110 140 1.3 - No Workers killed 4 Better memory management of workers needed.
monthly 110 140 1.3 - No 8 Memory was better, but throughput became unworkable.
monthly 32 140 4.4 - No Coordinator killed -
monthly 64 140 2.18 - No Workers killed -
daily 180 236 1.3 - No - - HTTP timeouts during startup
daily 96 140 1.5 - No - - HTTP timeouts during startup
Upvotes: 1