Reputation: 21
I have very important question cause I must make a presentation about map-reduce. My Question is:
I have read that the file in map-reduce is divided into blocks and every blocks is replicated in 3 different nodes. the block can be 128 MB is this Block the input file? i mean this 128 MB block will be Splitting into parts and every part will go to single map? if yes so this 128 MB will be divided into Which Size? or the File breaks into blocks and this blocks is the input for mapper I'm little bit confused.
Could you see the photo and tell me which one is right.
Here HDFS File is divided into blocks and every singel block 128. MB will be as input for 1 Map
Upvotes: 2
Views: 557
Reputation: 419
HDFS stores the file as blocks and each block is 128Mb in size (default). Mapreduce processes this HDFS file. Each mapper processes a block (input split). So, to answer your question, 128 Mb is a single block size which will not be further split.
Note : input split size used in mapreduce context is logical split, whereas the split size mentioned in the HDFS is physical split.
Upvotes: 1
Reputation: 858
Let's say you have a file of 2GB and you want to place that file in HDFS, then there will be 2GB/128MB = 16 blocks and these block will be distributed across the different DataNodes.
Data splitting happens based on file offsets. The goal of splitting the file and store it into different blocks, is parallel processing and fail over of data.
Split is logical split of the data, basically used during data processing using Map/Reduce program or other data-processing techniques in Hadoop. Split size is user defined value and one can choose his own split size based on the volume of data(How much data you are processing).
Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split. (i.e., Input Split = Input Block. So 16 mappers will be triggered for a 2 GB file). If Split size is defined as 100 MB (lets say), then 21 Mappers will be triggered (20 Mappers for 2000MB and 21st Mapper for 48MB).
Hope this clears your doubt.
Upvotes: 1