Xion345
Xion345

Reputation: 1616

What's the best way to have multiple outputs for a job using Hadoop stable version?

I have a mapreduce job whose role is to split my input file into two files according to a given criterion. I am currently using Hadoop r0.20.203 because it is the current stable version
This version offers two APIs :

As you can imagine, I am using the new API, and my problem is that Hadoop r0.20.203 does not offer any MultipleOutput formats in the new API.
Hadoop 0.20.203 stills offers MultipleTextOutputFormat and MultipleTextOutputs (which are both suitable for my case) in the old API. Moreover, the newer Hadoop 0.22 offers MultipleOutputs in the new API.

I see four solutions to my problem :

What would you do if you were me ?

Upvotes: 0

Views: 1790

Answers (2)

Chris Shain
Chris Shain

Reputation: 51369

Because so much code relies on it, and because the new API (as you have discovered) was never fully implemented, they are probably un-deprecating the old API in a future version of Hadoop. I'd use the old API and not worry about it.

See http://www.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/

Upvotes: 1

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

Why don't you put the source code in your project and use it?

http://grepcode.com/file_/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.java/?v=source

It should be compatible with r0.20.203, actually I don't see classes which should not be available in the older version.

And there is really nothing magic about it, it just setup's several record writers for each configured output (type and stuff). I bet that you could have written your own in the time of formulating the question

Upvotes: 0

Related Questions