Reputation: 25
I have a simple class called Pair
that implements org.apache.hadoop.io.Writable
. It contains two fields and is used as a Value in the MapReduce process.
For each key, I want to find the pair with the largest value of one of Pair's fields (preco). In the reducer the following code produces the expected result:
float max = 0;
String country = "";
for (Pair p : values){
if (p.getPreco().get() > max)
{
max = p.getPreco().get();
country = p.getPais().toString();
}
}
context.write(key, new Pair(new FloatWritable(max), new Text(country)));
The following code, on the other hand, does not:
Pair max = new Pair();
for (Pair p : values)
if (p.getPreco().get() > max.getPreco().get())
max = p;
context.write(key, max);
The second code produces, for each key, the last value that is associated to it in the input file and not the highest value.
Is there a reason for this apparently strange behavior?
Upvotes: 1
Views: 51
Reputation: 13927
You have this problem because the reducer is reusing objects, so its iterator over the values is always passing you the same object. Thus this code:
max = p;
Will always refer the current value of p
. You need to copy the the data into max
for this to work properly and not reference the object. This is why the first version of your code is working.
Usually in Hadoop I would implement a .set()
method on a custom writable, this is a common pattern you will see. So your Pair
class might look a bit like (its missing the interface methods etc):
public class Pair implements Writable {
public FloatWritable max = new FloatWritable();
public Text country = new Text();
public void set(Pair p) {
this.max.set(p.max.get());
this.country.set(p.country);
}
}
And you would change your code to:
Pair max = new Pair();
for (Pair p : values) {
if (p.max().get() > max.max.get()) {
max.set(p);
}
}
context.write(key, max);
I haven't created getters
in Pair
so the code is changed slightly to directly access the public class variables.
Upvotes: 1