Reputation: 387
Read this on apache documentation:
InputSplit represents the data to be processed by an individual Mapper.
Typically, it presents a byte-oriented view on the input and is the responsibility of RecordReader of the job to process this and present a record-oriented view.
Link - https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/mapred/InputSplit.html
Can somebody explain the difference between byte-oriented view and record-oriented view?
Upvotes: 3
Views: 1350
Reputation: 8091
HDFS splits its blocks (byte-oriented view) so that each block is less than or equal to the block size configured. So it is considered to be not following a logical split. Means a part of last record may reside in one block and rest of it is in another block. This seems correct for storage. But At processing time, the partial records in a block cannot be processed as it is. So the record-oriented view comes into place. This will ensure to get the remaining part of the last record in the other block to make it a block of complete records. This is called input-split (record oriented view).
Upvotes: 4