Davide Carneiro
Davide Carneiro

Reputation: 25

Strange behavior on Hadoop's reducer

I have a simple class called Pair that implements org.apache.hadoop.io.Writable. It contains two fields and is used as a Value in the MapReduce process.

For each key, I want to find the pair with the largest value of one of Pair's fields (preco). In the reducer the following code produces the expected result:

float max = 0;
String country = "";
for (Pair p : values){
    if (p.getPreco().get() > max)
    {
        max = p.getPreco().get();
        country = p.getPais().toString();
    }
}
context.write(key, new Pair(new FloatWritable(max), new Text(country)));

The following code, on the other hand, does not:

Pair max = new Pair();
for (Pair p : values)
    if (p.getPreco().get() > max.getPreco().get())
        max = p;

context.write(key, max);

The second code produces, for each key, the last value that is associated to it in the input file and not the highest value.

Is there a reason for this apparently strange behavior?

Upvotes: 1

Views: 51

Answers (1)

Binary Nerd
Binary Nerd

Reputation: 13927

You have this problem because the reducer is reusing objects, so its iterator over the values is always passing you the same object. Thus this code:

max = p;

Will always refer the current value of p. You need to copy the the data into max for this to work properly and not reference the object. This is why the first version of your code is working.

Usually in Hadoop I would implement a .set() method on a custom writable, this is a common pattern you will see. So your Pair class might look a bit like (its missing the interface methods etc):

public class Pair implements Writable {

    public FloatWritable max = new FloatWritable();
    public Text country = new Text();

    public void set(Pair p) {
        this.max.set(p.max.get());
        this.country.set(p.country);
    }
}

And you would change your code to:

Pair max = new Pair();
for (Pair p : values) {
    if (p.max().get() > max.max.get()) {
        max.set(p);
    }
}
context.write(key, max);

I haven't created getters in Pair so the code is changed slightly to directly access the public class variables.

Upvotes: 1

Related Questions