batman
batman

Reputation: 269

How to use a MapReduce output in Distributed Cache

Lets say i have a MapReduce Job which is creating an output file part-00000 and there is one more job running after the completion of this job.

How can i use the output file of the first job in the Distributed cache for the second job.

Upvotes: 0

Views: 270

Answers (1)

suresiva
suresiva

Reputation: 3173

The below steps might help you,

  • Pass the first job's output directory path to the Second job's Driver class.

  • Use Path Filter to list files starts with part-*. Refer the below code snippet for your second job driver class,

        FileSystem fs = FileSystem.get(conf);
        FileStatus[] fileList = fs.listStatus(new Path("1st job o/p path") , 
                                   new PathFilter(){
                                         @Override public boolean accept(Path path){
                                                return path.getName().startsWith("part-");
                                         } 
                                    } );
    
  • Iterate over every part-* file and add it to distribute cache.

        for(int i=0; i < fileList.length;i++){ 
                 DistributedCache.addCacheFile(new URI(fileList[i].getPath().toUri()));
        }
    

Upvotes: 4

Related Questions