Reputation: 269
Lets say i have a MapReduce Job which is creating an output file part-00000
and there is one more job running after the completion of this job.
How can i use the output file of the first job in the Distributed cache for the second job.
Upvotes: 0
Views: 270
Reputation: 3173
The below steps might help you,
Pass the first job's output directory path to the Second job's Driver class.
Use Path Filter to list files starts with part-*
. Refer the below code snippet for your second job driver class,
FileSystem fs = FileSystem.get(conf);
FileStatus[] fileList = fs.listStatus(new Path("1st job o/p path") ,
new PathFilter(){
@Override public boolean accept(Path path){
return path.getName().startsWith("part-");
}
} );
Iterate over every part-*
file and add it to distribute cache.
for(int i=0; i < fileList.length;i++){
DistributedCache.addCacheFile(new URI(fileList[i].getPath().toUri()));
}
Upvotes: 4