Hadoop MR hold array reference in reduce method

Question

I would like to have an arrayList that holds reference to object inside the reduce function.

@Override
public void reduce( final Text pKey,
                    final Iterable pValues,
                    final Context pContext )
        throws IOException, InterruptedException{
    final ArrayList bsonObjects = new ArrayList();

    for ( final BSONWritable value : pValues ){
        bsonObjects.add(value);
        //do some calculations.
    }
   for ( final BSONWritable value : bsonObjects ){
       //do something else.
   }
   }

The problem is that the bsonObjects.size() returns the correct number of elements but all the elements of the list are equal to the last inserted element. e.g. if the

{id:1}

{id:2}

{id:3}

elements are to be inserted the bsonObjects will hold 3 items but all of them will be {id:3}. Is there a problem with this approach? any idea why this happens? I have tried to change the List to a Map but then only one element was added to the map. Also I have tried to change the declaration of the bsonObject to global but the same behavior happes.

Girish Rao · Accepted Answer

This is documented behavior. The reason is that the pValues Iterator re-uses the BSONWritable instance and when it's value changes in the loop all references in bsonObjects ArrayList are updated as well. You're storing a reference when you call add() on bsonObjects. This approach allows Hadoop to save memory.

You should instantiate a new BSONWritable variable in that first loop that equals the variable value (deep copy). Then add the new variable into bsonObjects.

Try this:

for ( final BSONWritable value : pValues ){
    BSONWritable v = value; 
    bsonObjects.add(v);
    //do some calculations.
}
for ( final BSONWritable value : bsonObjects ){
   //do something else.
}

Then you will be able to iterate through bsonObjects in the second loop and retrieve each distinct value.

However, you should also be careful -- if you make a deep copy all the values for the key in this reducer will need to fit in memory.

Hadoop MR hold array reference in reduce method

Answers (1)

Related Questions