user3101883
user3101883

Reputation: 19

Part files in mapper Output Represent the Split?

Do part files which are generated as an output of a mapper only job as part-m-00000,Part-m-00001,so on represent the first input split, second input split and so on and are they generated sequentially ??

Upvotes: 2

Views: 357

Answers (1)

Joydip Datta
Joydip Datta

Reputation: 449

May not be. The split array returned by the getSplits() method is sorted into order based on size, so that the biggest go first. This sorted array is passed farther down and map tasks are created for each element. So, the ordering information would be lost when you do the sort.

Reference: org.apache.hadoop.mapreduce.JobSubmitter class. See method writeSplits(..)

Link to source code: https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java


Further reading on how the file names are decided:

Once the task id is determined, the name of the file is decided by the getDefaultWorkFile API available in org.apache.hadoop.mapreduce.lib.output.FileOutputFormat class. Here is the documentation:

getDefaultWorkFile

public Path getDefaultWorkFile(TaskAttemptContext context,
                               String extension)
                        throws IOException
Get the default path and filename for the output format.
Parameters:
context - the task context
extension - an extension to add to the filename
Returns:
a full path $output/_temporary/$taskid/part-[mr]-$id

This means "part" is postfixed with the task type, 'm' for maps, 'r' for reduces and the task partition number (i.e. task id). For example, the file for the first map of the job the generated name will be 'part-m-00000'.

Javadoc reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext, java.lang.String)

The older FileOutputFormat API sitting in org.apache.hadoop.mapred package also works in a similar way. Here is the reference: https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapred/FileOutputFormat.html#getUniqueName(org.apache.hadoop.mapred.JobConf, java.lang.String)

Upvotes: 1

Related Questions