How to use a MapReduce output in Distributed Cache

Question

Lets say i have a MapReduce Job which is creating an output file part-00000 and there is one more job running after the completion of this job.

How can i use the output file of the first job in the Distributed cache for the second job.

suresiva · Accepted Answer

The below steps might help you,

Pass the first job's output directory path to the Second job's Driver class.

Use Path Filter to list files starts with part-*. Refer the below code snippet for your second job driver class,

    FileSystem fs = FileSystem.get(conf);
    FileStatus[] fileList = fs.listStatus(new Path("1st job o/p path") , 
                               new PathFilter(){
                                     @Override public boolean accept(Path path){
                                            return path.getName().startsWith("part-");
                                     } 
                                } );

Iterate over every part-* file and add it to distribute cache.

    for(int i=0; i < fileList.length;i++){ 
             DistributedCache.addCacheFile(new URI(fileList[i].getPath().toUri()));
    }

How to use a MapReduce output in Distributed Cache

Answers (1)

Related Questions