Reputation: 3739
I have a mapreduce Mapper. This Mapper should use some set of read-only parameters. Let's imagine that I want to count occurences of some substrings (title of something) in input lines. I do have a list of pairs : "some title" => "a regular expression to extract this title from input line". These pairs are stored in usual text file.
What is the best way to pass this file to Mapper? I have only this idea:
Is it good or bad? please adivce
Upvotes: 1
Views: 3613
Reputation: 3739
Here is a part of my code. See the script that copies files to HDFS and launches mr-job. I do upload this script to hadoop node during maven intergation-test phase using ant: scp, ssh targets.
#dummy script for running mr-job
hadoop fs -rm -r /HttpSample/output
hadoop fs -rm -r /HttpSample/metadata.csv
hadoop fs -rm -r /var/log/hadoop-yarn/apps/cloudera/logs
#hadoop hadoop dfs -put /home/cloudera/uploaded_jars/metadata.csv /HttpSample/metadata.csv
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/metadata.csv /HttpSample/metadata.csv
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/opencsv.jar /HttpSample/opencsv.jar
hadoop fs -copyFromLocal /home/cloudera/uploaded_jars/gson.jar /HttpSample/gson.jar
#Run mr job
cd /home/cloudera/uploaded_jars
#hadoop jar scoring-job.jar ru.megalabs.mapreduce.scoringcounter.Main -libjars gson.jar -files hdfs://0.0.0.0:8020/HttpSample/metadata.csv -libjars hdfs://0.0.0.0:8020/HttpSample/opencsv.jar, hdfs://0.0.0.0:8020/HttpSample/gson.jar /HttpSample/raw_traffic.json /HttpSample/output/scoring_result
hadoop jar scoring-job.jar ru.megalabs.mapreduce.scoringcounter.Main -files hdfs://0.0.0.0:8020/HttpSample/metadata.csv -libjars hdfs://0.0.0.0:8020/HttpSample/opencsv.jar,hdfs://0.0.0.0:8020/HttpSample/gson.jar /HttpSample/raw_traffic.json /HttpSample/output/scoring_result
And the code inside Mapper:
public class ScoringCounterMapper extends Mapper<LongWritable, Text, GetReq, IntWritable> {
private static final Log LOG = LogFactory.getLog(ScoringCounterMapper.class);
private static final String METADATA_CSV = "metadata.csv";
private List<RegexMetadata> regexMetadatas = null;
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
//bal-bla-lba
}
@Override
protected void setup(Context context) throws IOException, InterruptedException {
MetadataCsvReader metadataCsvReader = new MetadataCsvReader(new File(METADATA_CSV));
regexMetadatas = metadataCsvReader.getMetadata();
for(RegexMetadata rm : regexMetadatas){
LOG.info(rm);
}
}
}
See that: 1. I do upload my metadata file to node 2. I do put it to HDFS 3. I do provide path to file using -Files argument 4. I do specify that this file is inside HDFS (hdfs://0.0.0.0:8020/HttpSample/metadata.csv)
Upvotes: 1
Reputation: 18434
You're on track, but I would recommend using the distributed cache. Its purpose is for exactly this - passing read-only files to task nodes.
configure
or setup
method depending on which version of the API you are using. In that method it can read from the distributed cache and store everything in memory.Upvotes: 4