bukli
bukli

Reputation: 172

Hadoop makes only one output file even when I set numReducetasks to 2

I have setup hadoop on ubuntu in pseudo distributed mode. My understanding is that I can run a job with more than one reducer in pseudo mode. But even after setting numReducetasks to 2, I am getting only one output file as partr0000. why is it ?

Note: My input file has only 12 records. It is a secondary sort MR program.

Thanks for your help.

Upvotes: 1

Views: 2418

Answers (3)

Amar
Amar

Reputation: 12020

If you see the getPartition() of the default partiotioner, HashPartitioner, it looks like as folllows:

public int getPartition(K key, V value, int numReduceTasks) { 
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; 
}

I think here it is happening just by chance that all the records are falling under the same partition. Just run a test and see what values you get for the above function with your 12 records.

And number of reducers and number of output files need not be always same, as there might be no output records from some reducers (even though it received some inputs).

Also checkout the hadoop UI from your browser when you run this job, you should see total number of reducers to the number you have set. Check your conf/mapred-site.xml and their you can find the URL to hit for seeing the Hadoop UI :

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>            

So, if you hit localhost:54311 , you should see the running jobs list.

Upvotes: 3

Tariq
Tariq

Reputation: 34184

When you set no. of reducers through numReducetasks, it is just a hint to the framework. I doesn't guarantee that you'll get only the specified no. of reducers as it actually depends on the no. of partitions you get after the map phase. And based on the no. of partitions you'll get the no. of reducers. Partitioning happens based on keys and the default partitioner is hash partitioner. So the keys are hashed based on a hash function and clubbed into groups. When you talk about so small data all the keys go to the same partition, as the framework tries its best to make the processing as efficient as possible and creating multiple partitions for such a small data would be an overkil.

Upvotes: 0

smttsp
smttsp

Reputation: 4191

I think it is because of number of records. Hadoop takes 64 MB of data for each cluster by default since your data is less than 1 block it is not divided into multiple blocks.

Upvotes: 1

Related Questions