Capacytron
Capacytron

Reputation: 3739

Pass file with parameters to mapreduce job

I have a mapreduce Mapper. This Mapper should use some set of read-only parameters. Let's imagine that I want to count occurences of some substrings (title of something) in input lines. I do have a list of pairs : "some title" => "a regular expression to extract this title from input line". These pairs are stored in usual text file.

What is the best way to pass this file to Mapper? I have only this idea:

  1. Upload file with pairs to hdfs.
  2. Pass path to file using -Dpath.to.file.with.properties
  3. in static{} section of mapper read file and populate map pair "some title" => "regular expr for the title".

Is it good or bad? please adivce

Upvotes: 1

Views: 3613

Answers (2)

Capacytron
Capacytron

Reputation: 3739

Here is a part of my code. See the script that copies files to HDFS and launches mr-job. I do upload this script to hadoop node during maven intergation-test phase using ant: scp, ssh targets.

#dummy script for running mr-job
hadoop fs -rm -r /HttpSample/output
hadoop fs -rm -r /HttpSample/metadata.csv
hadoop fs -rm -r /var/log/hadoop-yarn/apps/cloudera/logs
#hadoop hadoop dfs -put /home/cloudera/uploaded_jars/metadata.csv /HttpSample/metadata.csv
hadoop fs -copyFromLocal  /home/cloudera/uploaded_jars/metadata.csv /HttpSample/metadata.csv
hadoop fs -copyFromLocal  /home/cloudera/uploaded_jars/opencsv.jar /HttpSample/opencsv.jar
hadoop fs -copyFromLocal  /home/cloudera/uploaded_jars/gson.jar /HttpSample/gson.jar
#Run mr job
cd /home/cloudera/uploaded_jars
#hadoop jar scoring-job.jar ru.megalabs.mapreduce.scoringcounter.Main -libjars gson.jar -files hdfs://0.0.0.0:8020/HttpSample/metadata.csv -libjars hdfs://0.0.0.0:8020/HttpSample/opencsv.jar, hdfs://0.0.0.0:8020/HttpSample/gson.jar /HttpSample/raw_traffic.json /HttpSample/output/scoring_result
hadoop jar scoring-job.jar ru.megalabs.mapreduce.scoringcounter.Main -files hdfs://0.0.0.0:8020/HttpSample/metadata.csv -libjars hdfs://0.0.0.0:8020/HttpSample/opencsv.jar,hdfs://0.0.0.0:8020/HttpSample/gson.jar /HttpSample/raw_traffic.json /HttpSample/output/scoring_result

And the code inside Mapper:

public class ScoringCounterMapper extends Mapper<LongWritable, Text, GetReq, IntWritable> {

    private static final Log LOG = LogFactory.getLog(ScoringCounterMapper.class);

    private static final String METADATA_CSV = "metadata.csv";

    private List<RegexMetadata> regexMetadatas = null;

    private final static IntWritable one = new IntWritable(1);

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//bal-bla-lba
}

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
    MetadataCsvReader metadataCsvReader = new MetadataCsvReader(new File(METADATA_CSV));
    regexMetadatas = metadataCsvReader.getMetadata();
    for(RegexMetadata rm : regexMetadatas){
        LOG.info(rm);   
    }


    }
}

See that: 1. I do upload my metadata file to node 2. I do put it to HDFS 3. I do provide path to file using -Files argument 4. I do specify that this file is inside HDFS (hdfs://0.0.0.0:8020/HttpSample/metadata.csv)

Upvotes: 1

Joe K
Joe K

Reputation: 18434

You're on track, but I would recommend using the distributed cache. Its purpose is for exactly this - passing read-only files to task nodes.

  1. Put file in HDFS
  2. Add that file to the distributed cache in the main method of your application.
  3. In the Mapper class, override either the configure or setup method depending on which version of the API you are using. In that method it can read from the distributed cache and store everything in memory.

Upvotes: 4

Related Questions