Reputation: 44

Hadoop : Why using FileSplit in the RecordReader Implementation

In Hadoop, Considering a scenario if a bigfile is already loaded into the hdfs filesystem, using either hdfs dfs put or hdfs dfs CopyFromLocal command, the bigfile will be splitted into blocks(64 MB).

In this case, When a customRecordReader has to be created to read the bigfile, Pls explain the reason for using FileSplit, when the bigfile is already splitted during the file loading progress and available in the forms of splitted blocks.

Upvotes: 0

Answers (3)

Remus Rusanu

Reputation: 294437

You are comparing apples and Oranges. The full name of the FileSplit is org.apache.hadoop.mapred.FileSplit, emphasis on mapred. Is a MapReduce conspet, not a file system one. FileSplit is simply a specialization of an InputSplit:

InputSplit represents the data to be processed by an individual Mapper.

You are unnecessarily adding to the discussion HDFS concepts like blocks. MapReduce is unrelated to HDFS (they have synergy together, true). MapReduce can run on many other file systems, like local raw, S3, Azure blobs etc.

Whether a FileSplit happens to coincide with an HDFS block is pure coincidence, from your point of view (is not coincidence as the job is split to take advantage of HDFS block locality, but that is a detail).

Upvotes: 0

Mike Park

Reputation: 10941

Pls explain the reason for using FileSplit, when the bigfile is already splitted during the file loading progress and available in the forms of splitted blocks.

I think you might be confused about what a FileSplit actually is. Let's say your bigfile is 128MB and your block size is 64MB. bigfile will take up two blocks. You know this already. You will also (usually) get two FileSplits when the file is being processed in MapReduce. Each FileSplit maps to a block as it was previously loaded.

Keep in mind that the FileSplit class does not contain any of the file's actual data. It is simply a pointer to data within the file.

Upvotes: 1

Abraham

Reputation: 443

HDFS splits the files in blocks for storage purpose and may split the data across multiple blocks based on actual file size. So in case you are writing a customRecordReader then you will have to tell your record reader where to start and stop reading the block so that you can process the data. Reading the data from the beginning of each block or stopping your read at the end of each block may give your mapper incomplete records.

Upvotes: 0

Hadoop : Why using FileSplit in the RecordReader Implementation

Answers (3)

Related Questions