Reputation:
Suppose I have one input file and there are three blocks created in HDFS for this file. Assuming I have three data nodes and each data node is storing one block. If I have 3 input splits, 3 mappers will be running in parallel to process the data local to the respective data nodes. Each mapper gets input in terms of key value pairs using Input Format and Record Reader. This scenario with TextInputFormat where the record is complete line of text from file.
Question here is what happens if there is record break at the end of the first block.
1) How Hadoop reads the complete record in this scenario?
2) Does data node1 contact the data node 2 to get the complete record?
3) What happens if Data node 2 started processing the data and identifies the incomplete record in first line?
Upvotes: 4
Views: 2754
Reputation: 91
From hadoop source code of LineRecordReader.java the contructor: I find some comments :
// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
from this I believe (not confirmed) hadoop will read one extra line for each split(at the end of current split, read next line in next split), and if not first split, the first line will be throw away. so that no line record will be lost and incomplete
Upvotes: 0
Reputation: 30089
Hope that helps
Upvotes: 4
Reputation: 1799
If you have "Hadoop: The Definitive Guide", take a look at page 246 (in the latest edition) which discusses this exact issue (although rather briefly, unfortunately).
Upvotes: 1