user1261215
user1261215

Reputation:

Record Reader and Record Boundaries

Suppose I have one input file and there are three blocks created in HDFS for this file. Assuming I have three data nodes and each data node is storing one block. If I have 3 input splits, 3 mappers will be running in parallel to process the data local to the respective data nodes. Each mapper gets input in terms of key value pairs using Input Format and Record Reader. This scenario with TextInputFormat where the record is complete line of text from file.

Question here is what happens if there is record break at the end of the first block.

1) How Hadoop reads the complete record in this scenario?

2) Does data node1 contact the data node 2 to get the complete record?

3) What happens if Data node 2 started processing the data and identifies the incomplete record in first line?

Upvotes: 4

Views: 2754

Answers (3)

Shenghai.Geng
Shenghai.Geng

Reputation: 91

From hadoop source code of LineRecordReader.java the contructor: I find some comments :

// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
  start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;

from this I believe (not confirmed) hadoop will read one extra line for each split(at the end of current split, read next line in next split), and if not first split, the first line will be throw away. so that no line record will be lost and incomplete

Upvotes: 0

Chris White
Chris White

Reputation: 30089

  1. Hadoop will continue to read past the end of the first block until the EOL character or EOF is reached.
  2. That data nodes do not communicate with each other outside of data replication (when instructed by the name node). The HDFS client will read data from node1 then node2
  3. Some examples to clarify
    • If you have a single line record spanning a 300MB file with 128MB block size - Mapper 2 and 3 will start reading from a given split offset of the file (128MB and 256MB respectively). They will both skip forward trying to find the next EOL character and start there records from that point. In this example, both mappers will actually process 0 records.
    • A 300MB file with two lines 150MB in length, 128 MB block size - mapper 1 will process the first line, finding the EOL character in block 2. Mapper 2 will start from offset 128MB (block 2) and scan forward to find the EOL character at offset 150MB. It will scan forward and find the EOF after block 3 and process this data. Mapper 3 will start at offset 256MB (block 3) and scan forward to the EOF before hitting a EOL character, and hence process 0 records
    • A 300MB file with 6 lines, each 50MB in length:
      • mapper 1 - offset 0 -> 128MB, lines 1 (0->50), 2 (50->100), 3 (100->150)
      • mapper 2 - offset 128 MB -> 256 MB, lines 4 (150->200), 5 (200->250), 6 (250->300)
      • mapper 3 - offset 256 MB -> 300 MB, 0 lines

Hope that helps

Upvotes: 4

tsiki
tsiki

Reputation: 1799

  1. Hadoop will do a remote read to node 2 to get the rest of the record
  2. Yes
  3. From what I understand, node 2 will ignore the incomplete record

If you have "Hadoop: The Definitive Guide", take a look at page 246 (in the latest edition) which discusses this exact issue (although rather briefly, unfortunately).

Upvotes: 1

Related Questions