Hadoop custom partitioner

Question

I want to know if following aspect can be done in Hadoop:

Suppose I have 3 machines, where it will run 3 map functions and 3 reduce functions, in the normal case, a map and a reduce function on each machine. I have a set of keys: A, B, C, D, E, F, G, H, I.

After the map phase is it possible to force that all the values with keys A, B, C will reside always on machine 1 and all the values with keys D, E, F, will reside always on machine 2 etc ?

Currently I use a partitioner based on hash(key). This job will run more than once and I don't want to have values with the keys G,H,I on machine 1, only on machine 3.

Chris White · Accepted Answer

With a custom partitioner you can define that A, B and C will all be sent to the same reducer, but you are not able to control which node in your cluster will actually run that reduce task.

You should also note that even if you define that A, B, and C will all be sent to the same reducer, it possible that D, E and F will also be sent to the same reducer - if you only configure a single reducer for example.

Hadoop custom partitioner

Answers (1)

Related Questions