FourOfAKind
FourOfAKind

Reputation: 2418

Is the Input Split size constant or does it depend on Logical record?

Hadoop Definitive Guide says:

When you have Minimum split size 1, Maximum split size Long.MAX_VALUE, Block 
size 64MB then the Split size is 64MB.

TextInputFormat's logical records are lines. As the each line length is different how can we have split of size exactly 64MB?

Upvotes: 1

Views: 865

Answers (2)

Roger
Roger

Reputation: 2953

Always follow the 2 rules:

  1. Determine if your in the middle of a record
  2. can over that record and read the next full record

The first half of the record goes as the last record of previous InputSplit

Upvotes: 1

Razvan
Razvan

Reputation: 10093

HDFS blocks are sequences of bytes. They are not aware of lines or any other structure. So you might have a split made of only one block (of course of size 64MB) ending in the middle of a line (i.e. not including the whole last line). When you read it with TextInputFormat, it will take care to read some bytes form the next block too so that you get also the entire last line.

Upvotes: 3

Related Questions