abbasdjinn
abbasdjinn

Reputation: 157

How job client in hadoop compute inputSplits

I am trying to get the insight of map reduce architecture. I am consulting this http://answers.oreilly.com/topic/2141-how-mapreduce-works-with-hadoop/ article. I have some questions regarding the component JobClient of mapreduce framework. My questions is:

How the JObClient Computes the input Splits on the data?

According to the stuff to which i am consulting , Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and compued input splits) to the HDFS. Now here is my question, when the input data is in HDFS, why jobClient copies the computed inputsplits into HDFS.

Lets assume that Job Client copies the input splits to the HDFS, Now when the JOb is submitted to the Job Tracker and Job tracker intailize the job why it retrieves input splits from HDFS?

Apologies if my question is not clear. I am a beginner. :)

Upvotes: 1

Views: 2572

Answers (2)

Niranjan Sarvi
Niranjan Sarvi

Reputation: 899

The computation of input split depends on the Input Format. For a typical textual Input format, the generic formula to calculate the split size is

  max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))

or by default

Input split size = mapred.min.split.size < dfs.block.size < mapred.max.split.size

Where
mapred.min.split.size= Minimum Split size
mapred.max.split.size - Maximum Split size
dfs.block.size= DFS Block size

For DB Input Format, the split size is
(total records / number of mappers)

With the above said, number of input splits and size are the meta information given to the mapper tasks and Record readers.

Upvotes: 0

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

No the JobClient does not copy the input splits to the HDFS. You have quoted your answer for yourself:

Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and computed input splits) to the HDFS.

The input itself relies on the cluster. The client only computes on the meta information it got from the namenode (block size, data length, block locations). These computed input splits carry meta information to the tasks, e.G. of the block offset and the length to compute on.

Have a look at org.apache.hadoop.mapreduce.lib.input.FileSplit, it contains the file path the start offset and the length of the chunk a single task will operate on as its input. The serializable class you may also want to have a look at is: org.apache.hadoop.mapreduce.split.JobSplit.SplitMetaInfo.

This meta information will be computed for each task that will be run, and copied with the jars to the node that will actually execute this task.

Upvotes: 0

Related Questions