appengine-mapreduce fails with out of memory during shuffle stage

Question

I have ~50M entities stored in datastore. Each item can be of one type out of a total 7 types.

Next, I have a simple MapReduce job that counts the number of items of each type. It is written in python and based on appengine-mapreduce library. The Mapper emits (type, 1). The reducer simply adds the number of 1s received for each type.

When I run this job with 5000 shards, the map-stage runs fine. It uses a total of 20 instances which is maximum possible based on my task-queue configuration.

However, the shuffle-hash stage makes use of only one instance and fails with an out-of-memory error. I am not able to understand why only one instance is being used for hashing and how can I fix this out-of-memory error.

I have tried writing a combiner but I never saw a combiner stage on the mapreduce status page or in the logs.

Also, the wiki for appengine-mapreduce on github is obsolete and I cannot find an active community where I can ask questions.

ozarov · Accepted Answer

You are correct that the Python shuffle is in-memory based and does not scale. There is a way to make the Python MR use the Java MR shuffle phase (which is fast and scales). Unfortunately documentation about it (the setup and the how the 2 libraries communicate) is poor. See this issue for more information.

appengine-mapreduce fails with out of memory during shuffle stage

Answers (1)

Related Questions