Reputation: 1234
I have ~50M entities stored in datastore. Each item can be of one type out of a total 7 types.
Next, I have a simple MapReduce job that counts the number of items of each type. It is written in python and based on appengine-mapreduce library. The Mapper emits (type, 1)
. The reducer simply adds the number of 1s received for each type.
When I run this job with 5000 shards, the map-stage runs fine. It uses a total of 20 instances which is maximum possible based on my task-queue configuration.
However, the shuffle-hash stage makes use of only one instance and fails with an out-of-memory error. I am not able to understand why only one instance is being used for hashing and how can I fix this out-of-memory error.
I have tried writing a combiner but I never saw a combiner stage on the mapreduce status page or in the logs.
Also, the wiki for appengine-mapreduce on github is obsolete and I cannot find an active community where I can ask questions.
Upvotes: 1
Views: 165
Reputation: 1061
You are correct that the Python shuffle is in-memory based and does not scale. There is a way to make the Python MR use the Java MR shuffle phase (which is fast and scales). Unfortunately documentation about it (the setup and the how the 2 libraries communicate) is poor. See this issue for more information.
Upvotes: 2