Reputation: 1481
I am writing a Reducer in Hadoop and I am using its input values to build a byte array which encodes a list of elements. The size of the buffer in which I write my data depends on the number of values the reducer receives. It would be efficient to allocate its size in memory in advance, but I don't know how many values are without iterating on them with a "foreach" statement.
Hadoop output is an HBase table.
UPDATE: After processing my data with the mapper the reducer keys have a power law distribution. This means that only a few keys have a lot of value (at most 9000), but most of them have just a few values. I noticed that by allocating a buffer of 4096 bytes, 97.73% of the values fit in it. For the rest of them I can try to reallocate a buffer with double capacity, until all values fit in it. For my test case this can be accomplished by reallocating memory 6 times for the worst case, when there are 9000 values for a key.
Upvotes: 1
Views: 1266
Reputation: 379
You can use the following paradigm:
Map: Each mapper keeps a map from keys to integers, where M[k] is number of values sent out with a certain key k. At the end of its input, the map will also send out the key-value pairs (k, M[k]).
Sort: Use secondary sort so that the pairs (k, M[k]) come before the pairs (k, your values).
Reduce: Say we're looking at key k. Then the reducer first aggregates the counts M[k] coming from the different mappers to obtain a number n. This is the number you're looking for. Now you can create your data structure and do your computation.
Upvotes: 0
Reputation: 5239
I assume you're going to go through them with for-each anyway, after you've allocated your byte array, but you don't want to have to buffer all the records in memory (as you can only loop through the iterator you get back from your value collection once). Therefore, you could
Upvotes: 2