Reputation: 447
I have a file size of 100 MB and say default block size is 64 MB. If I do not set the input split size, the default split size will be block size. Now the split size is also 64 MB.
When I load this 100 MB file into HDFS, the 100 MB file will split into 2 blocks. i.e, 64 MB and 36 MB. For example below is a poem lyrics of size 100 MB. If I load this data into HDFS, say from line 1 to half of the line 16 as exactly 64 MB as one split/block (upto "It made the")and the remaining half of the line 16 (children laugh and play)to end of the file as second block (36 MB). There is going to be two mapper jobs.
My question is how the 1st mapper will consider the 16th line (that is line 16 of block 1) as that the block has only half the line, or how the second mapper will consider 1st line of block 2, as it is also having half the line.
Mary had a little lamb
Little lamb, little lamb
Mary had a little lamb
Its fleece was white as snow
And everywhere that Mary went
Mary went, Mary went
Everywhere that Mary went
The lamb was sure to go
He followed her to school one day
School one day, school one day
He followed her to school one day
Which was against the rule
It made the children laugh and play
Laugh and play, laugh and play
It made the children laugh and play
To see a lamb at school
And so the teacher turned him out
Turned him out, turned him out
And so the teacher turned him out
But still he lingered near
And waited patiently
Patiently, patiently
And wai-aited patiently
Til Mary did appear
Or while splitting 64 MB, instead of splitting the single line, will the hadoop consider the whole line 16?
Upvotes: 1
Views: 542
Reputation: 20820
In hadoop data is read based on Input split size and block size.
File is divided into multiple FileSplits based on the size. Each Input split is initialized with the start parameter corresponding to the offset in the input.
When we initialize the LineRecordReader, it tries to instantiate a LineReader which starts reading the lines.
If CompressionCodec is defined, it takes care of boundaries. So if the start of the InputSplit is not 0, then backtrack 1 character and then skip the first line,(encountered with \n or \r\n).Backtrack ensures that you don't skip the valid line.
Here is the code:
if (codec != null) {
in = new LineReader(codec.createInputStream(fileIn), job);
end = Long.MAX_VALUE;
} else {
if (start != 0) {
skipFirstLine = true;
--start;
fileIn.seek(start);
}
in = new LineReader(fileIn, job);
}
if (skipFirstLine) { // skip first line and re-establish "start".
start += in.readLine(new Text(), 0,
(int)Math.min((long)Integer.MAX_VALUE, end - start));
}
this.pos = start;
Since the splits are calculated in the client, the mappers don't need to run in sequence, every mapper already knows if it needs to discard the first line or not.
So as in your case, First block B1, will read the data from offset 0 to "It made the children laugh and play" line
Block B2 will read the data from "To see a lamb at school" line to the last line offset.
You can refer these for references:
https://hadoopabcd.wordpress.com/2015/03/10/hdfs-file-block-and-input-split/
How does Hadoop process records split across block boundaries?
Upvotes: 1
Reputation: 12502
The 1st mapper will read the whole 16th line (it'll keep reading until it finds and end of line character).
If you remember, in order to apply mapreduce your input has to be organized in key-value pairs. For TextInputFormat, which happens to be default in Hadoop, those pairs are: (offset_from_file_beginning, line_of_text). The text is being broken into those key-value pairs based on '\n' character. So, if a line of text goes beyond the size of input split, a mapper will keep reading until it finds a '\n'.
Upvotes: 0