Calin-Andrei Burloiu
Calin-Andrei Burloiu

Reputation: 1481

Can you know how many input values has a reducer in Hadoop without iterating on them?

I am writing a Reducer in Hadoop and I am using its input values to build a byte array which encodes a list of elements. The size of the buffer in which I write my data depends on the number of values the reducer receives. It would be efficient to allocate its size in memory in advance, but I don't know how many values are without iterating on them with a "foreach" statement.

Hadoop output is an HBase table.

UPDATE: After processing my data with the mapper the reducer keys have a power law distribution. This means that only a few keys have a lot of value (at most 9000), but most of them have just a few values. I noticed that by allocating a buffer of 4096 bytes, 97.73% of the values fit in it. For the rest of them I can try to reallocate a buffer with double capacity, until all values fit in it. For my test case this can be accomplished by reallocating memory 6 times for the worst case, when there are 9000 values for a key.

Upvotes: 1

Views: 1266

Answers (2)

had00b
had00b

Reputation: 379

You can use the following paradigm:

Map: Each mapper keeps a map from keys to integers, where M[k] is number of values sent out with a certain key k. At the end of its input, the map will also send out the key-value pairs (k, M[k]).

Sort: Use secondary sort so that the pairs (k, M[k]) come before the pairs (k, your values).

Reduce: Say we're looking at key k. Then the reducer first aggregates the counts M[k] coming from the different mappers to obtain a number n. This is the number you're looking for. Now you can create your data structure and do your computation.

Upvotes: 0

Judge Mental
Judge Mental

Reputation: 5239

I assume you're going to go through them with for-each anyway, after you've allocated your byte array, but you don't want to have to buffer all the records in memory (as you can only loop through the iterator you get back from your value collection once). Therefore, you could

  1. Run a counting reducer that outputs every input record and also outputs the count to a record that is of the same value class as the map output, and then run a "reduce-only" job on that result using a custom sort that puts the count first (recommended)
  2. Override the built-in sorting you get with Hadoop to count while sorting and inject that count record as the first record of its output (it's not totally clear to me how you would accomplish the override, but anything's possible)
  3. If the values are unique, you might be able to have a stateful sort comparator that retains a hash of the values with which it gets called (this seems awfully hacky and error prone, but I bet you could get it to work if the mechanics of secondary sort are confined to one class loader in one JVM)
  4. Design your reducer to use a more flexible data structure than a byte array, and convert the result to a byte array before outputting if necessary (highly recommended)

Upvotes: 2

Related Questions