Ankit Khettry
Ankit Khettry

Reputation: 1027

Why are my output files named 'part-r-xxxxx', even though I have not mentioned any reducer class?

I am using the Apache distribution of Hadoop 2.6.0. I am aware that the output files of mappers are named in the format 'part-m-xxxxx' for each mapper and those of reducers are named 'part-r-xxxxx' for each reducer. I was experimenting with a simple Max-Temperature use-case, and I have not set any reducer class in my Job configuration. This being the case, aren't the output files supposed to be named 'part-m-xxxxx'? Please find my Main class below:

public class MaxTemperature{

    public static void main(String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "Max Temperture");
        job.setJarByClass(MaxTemperature.class);
        int noOfInputPaths = args.length-1;
        for (int i=0; i<noOfInputPaths; i++){
            System.out.println("Adding Input path: "+args[i]);
            FileInputFormat.addInputPath(job, new Path(args[i]));
        }
        System.out.println("Output path: "+args[args.length - 1]);
        FileOutputFormat.setOutputPath(job, new Path(args[args.length - 1]));

        job.setMapperClass(MaxTemperatureMapper.class);
        //job.setReducerClass(MaxTemperatureReducer.class);
        //job.setNumReduceTasks(3);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);     

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        System.exit(job.waitForCompletion(true)? 0 : 1);
    }
}

Upvotes: 1

Views: 1019

Answers (2)

Kishore
Kishore

Reputation: 5891

If MapReduce programmer do not set the Reducer Class using job.setReducerClass then IdentityReducer.class is used as a default value. if you would only want to sort your input. An identity reducer can be used for example to implement embarrasingly parallel algorithms where you just use the mappers to perform the parallel tasks but you want the output key value pairs to be sorted. output will be part-r-xxxxx.

if you set

job.setNumReduceTasks(0);

in this condition no reducer will run and output of program will named as part-m-xxxxx. Output will be not sorted.

Upvotes: 1

java_bee
java_bee

Reputation: 453

The default Hadoop OutputFormat is being used and it will initialize and creating file called (part-r-xxxxx) and same you are seeing under output folder.

Now, the created file(s) are empty is because you are not writing(context.write(...)) in reducer part. But that does not stop them from being created file during initialization.

To stop this, you need to define output format to say that you are not expecting any output. Refer below.

myJob.setOutputFormat(NullOutputFormat.class);

With above property set, this should ensure that your part files are never initialized at all.

Note: you can use LazyOutputFormat which would ensure that your output files are only created if there is some data and will not initialize empty files. See below.

LazyOutputFormat.setOutputFormatClass(myJob, TextOutputFormat.class);

Hope this helps.

Upvotes: 1

Related Questions