jojoba
jojoba

Reputation: 554

Hadoop cache file for all map tasks

My map function has to read a file for every input. That file doesn't change at all, it is only for reading. Distributed cache might help me a lot i think, but i cant find a way to use it. The public void configure(JobConf conf) function that i need to override, i think is deprecated. Well JobConf is deprecated for sure. All the DistributedCache tutorials use the deprecated way to. What can i do? Is there another configure function that i can override??

These are the very first lines of my map function:

     Configuration conf = new Configuration();          //load the MFile
     FileSystem fs = FileSystem.get(conf);
     Path inFile = new Path("planet/MFile");       
     FSDataInputStream in = fs.open(inFile);
     DecisionTree dtree=new DecisionTree().loadTree(in);

I want to cache that MFile so that my map function doesn't need to look it over and over again

Upvotes: 4

Views: 4221

Answers (2)

jojoba
jojoba

Reputation: 554

Well i did it, i think. I followed Ravi Bhatt tips and i wrote this :

  @Override
  protected void setup(Context context) throws IOException, InterruptedException
  {      
      FileSystem fs = FileSystem.get(context.getConfiguration());
      URI files[]=DistributedCache.getCacheFiles(context.getConfiguration());
      Path path = new Path(files[0].toString());
      in = fs.open(path);
      dtree=new DecisionTree().loadTree(in);                 
  } 

Inside my main method i do this, to add it in the cache:

  DistributedCache.addCacheFile(new URI(args[0]+"/"+"MFile"), conf);
  Job job = new Job(conf, "MR phase one");

I am able to retrieve the file i need with this way, but cant tell yet if it works 100%. Is there any way to test it? Thanks.

Upvotes: 5

Ravi Bhatt
Ravi Bhatt

Reputation: 3163

Jobconf was deprecated in 0.20.x but in 1.0.0 it is not! :-) (as of writing this)

To your question, there are two ways to run map reduce jobs in java, one is by using (extending) classes in org.apache.hadoop.mapreduce package and other is by implementing classes in org.apache.hadoop.mapred package (or the other way round ).

Not sure which one you are using, if you don't have a configure method to override, you will get a setup method to override.

@Override
protected void setup(Context context) throws IOException, InterruptedException

This is similar to configure and should help you.

You get a setup method to override when you extend Mapper class in org.apache.hadoop.mapreduce package.

Upvotes: 1

Related Questions