qwwqwwq
qwwqwwq

Reputation: 7329

Hadoop InputFormat set Key to Input File Path

My hadoop job needs to be aware of the input path that each record is derived from.

For example assume I am running a job over a collection of S3 objects:

s3://bucket/file1
s3://bucket/file2
s3://bucket/file3

I would like to reduce key value pairs such as

s3://bucket/file1    record1
s3://bucket/file1    record2
s3://bucket/file2    record1
...

Is there an extension of org.apache.hadoop.mapreduce.InputFormat that would accomplish this? Or is there a better way to go about this than using a custom input format?

I know that in a mapper this information is accessible from the MapContext (How to get the input file name in the mapper in a Hadoop program?) but I am using Apache Crunch and I cannot control whether any of my steps will be Maps or Reduces, however I can reliably control the InputFormat so it seemed to me to be the place to do this.

Upvotes: 2

Views: 549

Answers (1)

Kamal
Kamal

Reputation: 61

Please have a look at my blog article to customize inputsplit and recordreader.

The code in that blog sets key as below (Line 69-70 of recordreader code)

value = new Text(line);
key = new LongWritable(splitstart);

In your case you need to set key as below, I didn't test it though.

key = fsplit.getPath().toString();

Upvotes: 1

Related Questions