Reputation: 1533
I want to write Java program which reads input from HDFS, processes it using MapReduce and writes the output into a MongoDb.
Here is the scenario:
Actually, reading from HDFS and processing it with MapReduce are simple. But I gets stuck about writing the result into a MongoDb. Is there any Java API supported to write the result into MongoDB? Another question is that since it is a Hadoop Cluster, so we don't know which datanode will run the Reducer task and generate the result, is it possible to write the result into a MongoDb which is installed on a specific server?
If I want to write the result into HDFS, the code will be like this:
public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException
long sum = 0;
for (LongWritable value : values)
sum += value.get();
context.write(new Text(key), new LongWritable(sum));
Now I want to write the result into a MongoDb instead of HDFS, how can I do that?
Upvotes: 1
Views: 2950
Reputation: 1130
You want «MongoDB Connector for Hadoop». The examples.
It's tempting to just add code in your Reducer that, as a side effect, inserts data into your database. Avoid this temptation. One reason to use a connector as opposed to just inserting data as a side effect of your reducer class is speculative execution: Hadoop can sometimes run two of the exact same reduce tasks in parallel, which can lead to extraneous inserts and duplicate data.
Upvotes: 2
Reputation: 931
I spent my morning to implement the same scenario. Here my solution:
Create three classes: reducer class
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import com.mongodb.hadoop.mapred.MongoOutputFormat;
public class Experiment extends Configured implements Tool{
public int run(final String[] args) throws Exception {
final Configuration conf = getConf();
conf.set("mongo.output.uri", args[1]);
final JobConf job = new JobConf(conf);
FileInputFormat.setInputPaths(job, new Path(args[0]));
return 0;
public static void main(final String[] args) throws Exception{
int res = TweetPerUserToMongo(), args);
When you run Experiment class from your cluster, you will enter two parameters. First parameter is your input source from HDFS location, second parameter refers to mongodb URI that is going keep your results. Here is an example call. Assuming that your is under the package name org.example.
sudo -u hdfs hadoop jar ~/jar/myexample.jar org.example.Experiment myfilesinhdfs/* mongodb://
This might not be the best way but it does the job for me.
Upvotes: 0
Reputation: 1490
Yes. You write to mongo as usual. The fact that your mongo db is set to run on shards is a detail that is hidden from you.
Upvotes: 0