Benedikt Bünz
Benedikt Bünz

Reputation: 646

Best practice to pass copy of object to all mappers in hadoop

Hello I am currently learning Map Reduce and am trying to build a small Job with hadoop 1.0.4. I have a list of stopp words and a list of patterns. Before my files are mapped I want to load the stoppwords in an efficient Datastructure such as a map. I also want to build one regex pattern from my patternlist. Since these are serial tasks I want to do them in front of the mapping and pass every mapper a copy of those to objects which they can read/write on. I thought about simply having a static variable with a getter in my drivers class but with the java call objects as pointers principle this doesn't work out. I could of course clone the object before I pass it, but this really does not seem like a good practice. I read something about distributed cache but as far as I understood it, its only for files and not for objects and than I could just let every mapper read the stopp word/pattern files.

Thanks for any help!

Upvotes: 2

Views: 3966

Answers (2)

David Gruzman
David Gruzman

Reputation: 8088

Hadoop distributed cache is a mechanism specific to pass some reference data to mappers. From the performance viewpoint it is better then loading from HDFS - since data will be passed from HDFS to local file system once per node, and not once per task.
You completely right - it is only for files and reading files and converting them to Your data structures is your responsibility.
In best of my understanding hadoop does not support passing objects. Although if you will use some kind of serialization in these files - it will be close to what you ask.

Upvotes: 2

Lorand Bendig
Lorand Bendig

Reputation: 10650

A possible solution is to copy the stopwords.txt to the HDFS before running the job, and then read it into an appropriate data structure in the Mapper's setup method. E.g:

MyMapper class:

...
private Map<String, Object> stopwords = null;

@Override
public void setup(Context context) {
    Configuration conf = context.getConfiguration();
    //hardcoded or set it in the jobrunner class and retrieve via this key
    String location = conf.get("job.stopwords.path");
    if (location != null) {
        BufferedReader br = null;
        try {
            FileSystem fs = FileSystem.get(conf);
            Path path = new Path(location);
            if (fs.exists(path)) {
                stopwords = new HashMap<String, Object>();
                FSDataInputStream fis = fs.open(path);
                br = new BufferedReader(new InputStreamReader(fis));
                String line = null;
                while ((line = br.readLine()) != null && line.trim().length() > 0) {
                    stopwords.put(line, null);
                }
            }
        }
        catch (IOException e) {
            //handle
        } 
        finally {
            IOUtils.closeQuietly(br);
        }
    }
}
...

Then you can use stopwords in your map method.

Another option is to create the map object with the stopwords in the jobrunner class, serialize it to a Base64 encoded String, pass it to the mappers as a value of some key in the Configuration object and deserialize it in the setup method.

I'd choose the first option, not just because it's easier, but because it's not a good idea to pass bigger amount of data via the Configuration object.

Upvotes: 2

Related Questions