Reputation: 29
I tried to write a simple word count program in mapreduce. My mapreduce program is writing output to the files only. But I don't want my output to be written in files. I want to collect that information or output ( like java collection ) to be used in my rest of the program area.
Just like for example if I submit any query on hive it returns a result set object but internally my query will be converted to mapreduce program and will return result set object once it finishes the job. It doesn't write results to a file system unlike other mapreduce programs.
So how can I collect that output or how can I prepare my own object in the reducer or mapper and collect that object in other areas of java program? I don't want that output to be written in files.
Upvotes: 3
Views: 2558
Reputation: 364
As per my understanding of your question is, your are using Hive for mapreduce to process HDFS data and you want to work with Hive output in end, by not saving output to HDFS. You can write the o/p to HDFS or local filesystem by using below commands in HIVE:
The following command outputs the table to local directory INSERT OVERWRITE LOCAL DIRECTORY '' SELECT * FROM table_name;
The following command outputs the table to a HDFS file INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM table_name;
Later, if you want to work with this o/p, with the other java MR o/p in HDFS.
For achieving this you need to write your Hive output, to HDFS before and go with below solutions to work with two different o/p's.
Solution1: Use Map side or Reduce side joins in JAVA language.
[OR]
Soultion2: Side by side technique using Jobconfig object or Hadoop Distributed Cache.
Upvotes: 0
Reputation: 6119
MapReduce tasks take a file either from HDFS or HBase generally.
First take the absolute path of the directory inside HDFS filesystem
Now in your map-reduce task's main method or batch, use setOutputFormat() of Job class to set the output format
Sample for text output is
Configuration conf = new Configuration();
Job job = new Job(conf, "app");
job.setJarByClass(Application.class); // batch/main method's class name
job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Now while running the hadoop task the 2nd parameter is the output path which is the path of subdirectory of HDFS.
Now since file is in HDFS we cant access it using normal unix command, first convert the file from HDFS to ext3/4 file format then read it using nano/vi
dfs -cat {path_to_outfile_inHDFS}
Upvotes: 0
Reputation: 176
MapReduce jobs tend to consume/produce large amounts of data. They also tend to be stand alone applications and not part of some larger workflow. Both of these statements don't appear to hold true in this case. You can set the output format to NullOutputFormat to prevent any files from being created. Then, you can add the results to your job conf as a String and that will make them available to anything that can read the conf.
Upvotes: 0
Reputation: 12443
There are many ways to handle the output of the Hadoop M-R Framework. The Primary interface for a user to describe a M-R job is the JobConf Class You will find
getOutputFormat()
and
setOutputFormat()
method(s) where you would/could describe different result collection such as DB (HBase) storage. The thing to remember is that M-R jobs process large volumes of data which would be cumbersome to manage in Java memory as Objects unless you had a well developed distributed Object architecture.
Alternativley you could provide your actual requirement.
Hope this helps, Pat
Upvotes: 2