Reputation: 101
Based on my understanding TextInputFormat
should split exactly at line breaks, but seems like I am wrong based on some answers I have seen on the website. Does anyone have a better explaining and which option is right?
Which of the following best describes the workings of TextInputFormat
?
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader
of the split that contains the end of the broken line.
The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader
of the split that contains the beginning of the broken line.
Input file splits may cross line breaks. A line that crosses tile splits is ignored.
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders
of both splits containing the broken line.
Upvotes: 1
Views: 1892
Reputation: 38950
Have a look at documentation page of TextInputFormat
An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.
Check implementation of TextInputFormat @grepcode ( Option 1 seems to be the right way)
@Override
public RecordReader<LongWritable, Text>
createRecordReader(InputSplit split,
TaskAttemptContext context) {
return new LineRecordReader();
}
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
while (pos < end) {
newSize = in.readLine(value, maxLineLength,
Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
maxLineLength));
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
Upvotes: 0
Reputation: 7148
Option 1. is correct. The last line of the first split would be part of first split, even though this would incur remote read and suffers data locality issue.
Its not always possible to have the end of the line coinciding with the split boundary.
Upvotes: 4