Reputation: 43
I'm using Hadoop 0.20.203.0. I want to output to two different files, so I'm trying to get MultipleOutputs working.
Here's my configuration method:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: indycascade <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "indy cascade");
job.setJarByClass(IndyCascade.class);
job.setMapperClass(ICMapper.class);
job.setCombinerClass(ICReducer.class);
job.setReducerClass(ICReducer.class);
TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
MultipleOutputs.addNamedOutput(conf, "sql", TextOutputFormat.class, LongWritable.class, Text.class);
job.waitForCompletion(true);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
However, this won't compile. The offending line is MultipleOutputs.addNamedOutput(...)
, which throws a "cannot find symbol" error.
isaac/me/saac/i/IndyCascade.java:94: cannot find symbol
symbol : method addNamedOutput(org.apache.hadoop.conf.Configuration,java.lang.String,java.lang.Class<org.apa che.hadoop.mapreduce.lib.output.TextOutputFormat>,java.lang.Class<org.apache.hadoop.io.LongWritable>,java.lang.Class<org.apache.hadoop.io.Text>)
location: class org.apache.hadoop.mapred.lib.MultipleOutputs
MultipleOutputs.addNamedOutput(conf, "sql", TextOutputFormat.class, LongWritable.class, Text.class);
Of course, I tried using a JobConf instead of Configuration, as the API demands, but that leads to the same error. Additionally, JobConf is deprecated.
How do I get MultipleOutputs to work? Is that even the correct class to use?
Upvotes: 0
Views: 2958
Reputation: 30089
You're mixing old and new API types:
You're using the old API org.apache.hadoop.mapred.lib.MultipleOutputs
:
location: class org.apache.hadoop.mapred.lib.MultipleOutputs
With the new API org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
:
symbol : method addNamedOutput(org.apache.hadoop.conf.Configuration,java.lang.String,java.lang.Class<org.apa che.hadoop.mapreduce.lib.output.TextOutputFormat>,java.lang.Class<org.apache.hadoop.io.LongWritable>,java.lang.Class<org.apache.hadoop.io.Text>)
Make the APIs consistent and you should be ok
Edit: Infact 0.20.203 doesn't have a port of MultipleOutputs for the new API, so you'll have to use the old api, find a new API port online Cloudera- 0.20.2+320), or port it yourself
Also, you should look at the ToolRunner class to execute your jobs, it will remove the need to explicitly call the GenericOptionsParser:
public static class Driver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new Driver(), args));
}
public int run(String args[]) {
if (args.length != 2) {
System.err.println("Usage: indycascade <in> <out>");
System.exit(2);
}
Job job = new Job(getConf());
Configuration conf = job.getConfiguration();
// insert other job set up here
return job.waitForCompletion(true) ? 0 : 1;
}
}
Final point - any reference to conf
after you create the Job
instance will be the original conf. Job makes a deep copy of the conf object, so calling MultipleOutputs.addNamedoutput(conf, ...)
will not have the desired effect, use MultipleOutputs.addNamedoutput(job.getConfiguration(), ...)
instead. See my example code above for the correct way to do this
Upvotes: 4