Reputation: 7329
My hadoop job needs to be aware of the input path that each record is derived from.
For example assume I am running a job over a collection of S3 objects:
s3://bucket/file1
s3://bucket/file2
s3://bucket/file3
I would like to reduce key value pairs such as
s3://bucket/file1 record1
s3://bucket/file1 record2
s3://bucket/file2 record1
...
Is there an extension of org.apache.hadoop.mapreduce.InputFormat
that would accomplish this? Or is there a better way to go about this than using a custom input format?
I know that in a mapper this information is accessible from the MapContext
(How to get the input file name in the mapper in a Hadoop program?) but I am using Apache Crunch and I cannot control whether any of my steps will be Maps or Reduces, however I can reliably control the InputFormat so it seemed to me to be the place to do this.
Upvotes: 2
Views: 549
Reputation: 61
Please have a look at my blog article to customize inputsplit and recordreader.
The code in that blog sets key as below (Line 69-70 of recordreader code)
value = new Text(line);
key = new LongWritable(splitstart);
In your case you need to set key as below, I didn't test it though.
key = fsplit.getPath().toString();
Upvotes: 1