Johan
Johan

Reputation: 3819

Default Record Reader in Hadoop, global or local byte offset

We know that a mapper in Hadoop (as well as a reducer) can only handle key-value pairs as input and output. A RecordReader is something that transforms the raw input from file into key-value pairs. You may write your own `RecordReader".

The default RecordReader provided by Hadoop is known as TextInputFormat which reads lines of text files. The key it emits for each record of the split is the byte offset of the line read (as a LongWritable), and the value is the contents of the line up to the terminating \n character (as a Text object).

We know also that one mapper for each input file split is instantiated by the platform.

Suppose there is a huge file F stored over HDFS with its splits stored over several different nodes; file F is line separated and is being processed by some job with default RecordReader. My question is: the byte offset of each line (used as key for that line) is computed locally with respect to the split or globally with respect to the overall file?

Simply put, suppose that I have a file of two splits by 4 lines each. For sake of semplicity, let each line be of 1 byte exactly so that the byte offsets are 0,1,2,3 for first four lines:

0 - Line 1
1 - Line 2
2 - Line 3
3 - Line 4

So in the mapper where this split is processed, Line i is supplied with key i-1 by default RecordReader. The second split is possibly in another node:

? - Line 5
? - Line 6
? - Line 7
? - Line 8

and the question is whether the byte offsets will be 4,5,6,7 or starting from scratch again 0,1,2,3.

Upvotes: 2

Views: 437

Answers (1)

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

It is the "global" offset.

You can see it in the code where the position is initialized from the file splits offset. In case of a very big file, it would be the byte offset of where the splitting has happened. The position is then incremented from there and passed along the line to your mapper code.

Upvotes: 1

Related Questions