Raj
Raj

Reputation: 101

Hadoop MapReduce TextInputFormat - how file split is done

Based on my understanding TextInputFormat should split exactly at line breaks, but seems like I am wrong based on some answers I have seen on the website. Does anyone have a better explaining and which option is right?

Which of the following best describes the workings of TextInputFormat?

  1. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.

  2. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines.

  3. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.

  4. Input file splits may cross line breaks. A line that crosses tile splits is ignored.

  5. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the broken line.

Upvotes: 1

Views: 1892

Answers (2)

Ravindra babu
Ravindra babu

Reputation: 38950

Have a look at documentation page of TextInputFormat

An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.

Check implementation of TextInputFormat @grepcode ( Option 1 seems to be the right way)

@Override
  public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    return new LineRecordReader();
  }

LineRecordReader:

 public boolean nextKeyValue() throws IOException {
    if (key == null) {
      key = new LongWritable();
    }
    key.set(pos);
    if (value == null) {
      value = new Text();
    }
    int newSize = 0;
    while (pos < end) {
      newSize = in.readLine(value, maxLineLength,
                            Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
                                     maxLineLength));
      if (newSize == 0) {
        break;
      }
      pos += newSize;
      if (newSize < maxLineLength) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }

Upvotes: 0

Ramzy
Ramzy

Reputation: 7148

Option 1. is correct. The last line of the first split would be part of first split, even though this would incur remote read and suffers data locality issue.

Its not always possible to have the end of the line coinciding with the split boundary.

Upvotes: 4

Related Questions