Reputation: 6465
Assume a map-reduce job with m mappers which is fed by an input file F. apparently the mapreduce framework splits F into chunks (64 MB as default value) and feeds each chunk to a mapper. My question is, if I run this mapreduce job a couple of times, is the way chunks are formed the same in all of them? That is, the points from which the mapreduce framework split F remains the same or it may differ?
As an example, assume F contains the following lines:
1,2
3,5
5,6
7,6
5,5
7,7
in the first run the mapreduce forms two chunks as follows:
Chunk 1:
1,2
3,5
5,6
Chunk 2:
7,6
5,5
7,7
My question is whether the way the split is done remains the same if I run it again?
Besides, does each chunk have a unique name that can be used in the mapper?
Upvotes: 0
Views: 1316
Reputation: 34184
My question is whether the way the split is done remains the same if I run it again?
It is true that input data gets split into chunks first and then each of these chunk is fed to a mapper. But, it is not always 64M. Perhaps you have got confused with the HDFS block(64M usually) and MR split. Both are totally different things. It is possible though that your split size and block size are same.
Coming to your actual question, yes it is same for all the jobs which are using the same InputFormat. Reason being, it is the job of the InputFormat, you are using, to create the splits. To be precise, the logic inside getSplits(JobContext context) of your InputFormat governs the split creation. So, if it is same in all the jobs, split creation will also be the same.
Besides, does each chunk have a unique name that can be used in the mapper?
Each chunk has 2 things :
Edit :
How to get the name of the file being executed by the mapper :
FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();
Now, you can open a FSDataInputStream on this file and read its content.
Hope it answers your query.
Upvotes: 1