Reputation: 6186
When I perform Hadoop streaming. There's the output of mapper (Key, Value) The key is a word sequence that separated with white-space.
I'd like to use partitioner that returns hash value of first two words.
So, implemented as
public static class CounterPartitioner extends Partitioner<Text, IntWritable> {
@Override
public int getPartition(Text key, IntWritable value, int numPartitions) {
String[] line = key.toString().split(" ");
String prefix = (line.length > 1) ? (line[0] + line[1]) : line[0];
return (prefix.hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
My question is is there a way by using built-in Hadoop library and modifying configuration such as
mapred.output.key.comparator.class
stream.map.output.field.separator
stream.num.map.output.key.fields
map.output.key.field.separator
mapred.text.key.comparator.options
...
Thanks in advance.
Upvotes: 1
Views: 1460
Reputation: 33555
When I perform Hadoop streaming. There's the output of mapper (Key, Value) The key is a word sequence that separated with white-space.
My question is is there a way by using built-in Hadoop library and modifying configuration such as
mapred.output.key.comparator.class stream.map.output.field.separator
Built-in Hadoop library is based on Java and the purpose of streaming is to use other languages besides Java which talks to STDIO/STDOUT.
I don't see the purpose of changing the streaming related properties using Hadoop API which is built using Java.
BYW, Configuration#set can be used to set the configuration properties besides setting them in the configuration files and from the command prompt.
Upvotes: 2