How does Hadoop split files without losing data integrity?

Question

We all know that if an input file is large it is split into equal-size splits (size of 64 MB by default). Let say I have a .txt file which is 104 MB large. Theoretically, this file is split in to 2 splits (one is 64 MB large and another is 40 MB large). Is it possible that the split can occur at the middle of a word? For example, "Hadoop", "Ha" will be the end of the first split and "doop" will be the beginning of the second split. If this occur, how we can perform WordCount problem properly?

Chris Gerken · Accepted Answer

That logic is encapsulated in the InputFormat configured for the mapper. There are different subclasses of InputFormat and you choose the subclass specific to the kind of file you consume with the Mapper. For example, the TextInputFormat class breaks lines on line feeds. There may be a partial line at the beginning or end of a split, but the logic recognizes those situations and still returns the complete line to exactly one mapper.

How does Hadoop split files without losing data integrity?

Answers (1)

Related Questions