Can you know how many input values has a reducer in Hadoop without iterating on them?

Question

I am writing a Reducer in Hadoop and I am using its input values to build a byte array which encodes a list of elements. The size of the buffer in which I write my data depends on the number of values the reducer receives. It would be efficient to allocate its size in memory in advance, but I don't know how many values are without iterating on them with a "foreach" statement.

Hadoop output is an HBase table.

UPDATE: After processing my data with the mapper the reducer keys have a power law distribution. This means that only a few keys have a lot of value (at most 9000), but most of them have just a few values. I noticed that by allocating a buffer of 4096 bytes, 97.73% of the values fit in it. For the rest of them I can try to reallocate a buffer with double capacity, until all values fit in it. For my test case this can be accomplished by reallocating memory 6 times for the worst case, when there are 9000 values for a key.

had00b · Accepted Answer

You can use the following paradigm:

Map: Each mapper keeps a map from keys to integers, where M[k] is number of values sent out with a certain key k. At the end of its input, the map will also send out the key-value pairs (k, M[k]).

Sort: Use secondary sort so that the pairs (k, M[k]) come before the pairs (k, your values).

Reduce: Say we're looking at key k. Then the reducer first aggregates the counts M[k] coming from the different mappers to obtain a number n. This is the number you're looking for. Now you can create your data structure and do your computation.

Can you know how many input values has a reducer in Hadoop without iterating on them?

Answers (2)

Related Questions