Hadoop InputFormat set Key to Input File Path

Question

My hadoop job needs to be aware of the input path that each record is derived from.

For example assume I am running a job over a collection of S3 objects:

s3://bucket/file1
s3://bucket/file2
s3://bucket/file3

I would like to reduce key value pairs such as

s3://bucket/file1    record1
s3://bucket/file1    record2
s3://bucket/file2    record1
...

Is there an extension of org.apache.hadoop.mapreduce.InputFormat that would accomplish this? Or is there a better way to go about this than using a custom input format?

I know that in a mapper this information is accessible from the MapContext (How to get the input file name in the mapper in a Hadoop program?) but I am using Apache Crunch and I cannot control whether any of my steps will be Maps or Reduces, however I can reliably control the InputFormat so it seemed to me to be the place to do this.

Hadoop InputFormat set Key to Input File Path

Answers (1)

Related Questions