Hadoop MapReduce TextInputFormat - how file split is done

Question

Based on my understanding TextInputFormat should split exactly at line breaks, but seems like I am wrong based on some answers I have seen on the website. Does anyone have a better explaining and which option is right?

Which of the following best describes the workings of TextInputFormat?

Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.
The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
Input file splits may cross line breaks. A line that crosses tile splits is ignored.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the broken line.

Ramzy · Accepted Answer

Option 1. is correct. The last line of the first split would be part of first split, even though this would incur remote read and suffers data locality issue.

Its not always possible to have the end of the line coinciding with the split boundary.

Hadoop MapReduce TextInputFormat - how file split is done

Answers (2)

Related Questions