scott
scott

Reputation: 235

What does mapreduce framework write to split metainfo file

I am getting the following error for a mapreduce job:

Job initialization failed: java.io.IOException: Split metadata size exceeded 10000000. Aborting job job_201511121020_1680 at org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:48) at org.apache.hadoop.mapred.JobInProgress.createSplits(JobInProgress.java:828) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:730) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:3775) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:90) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)

The input path to this job is : /dir1/dir2///year/mon/day ... (7 days)

Here is what I gathered from research : this error is caused because the split meta info size exceeds the limit (set by mapreduce.job.split.metainfo.maxsize). I am assuming this meta data is written to a file and its the size of the file that has exceeded the limit.

I have few more questions :

  1. Does the framework create one file or multiple files per job?
  2. What are the contents of this file? Given that the input path is deep, however, when I write all files returned by the input path to a file, the size it only few MBytes.

Any help in better understanding this error is appreciated.

Upvotes: 1

Views: 934

Answers (1)

Manjunath Ballur
Manjunath Ballur

Reputation: 6343

By default max size of split meta information is set to 10000000

public static final long DEFAULT_SPLIT_METAINFO_MAXSIZE = 10000000L

You can override it by setting the configuration parameter: mapreduce.job.split.metainfo.maxsize, in mapred-site.xml.

Now coming to your questions:

  1. One split file is created per job. The split file is stored in .staging folder for each job. The name of the split file is job.split.

  2. The contents of this file are:

    1) Split file header: "META-SPL"
    
    2) Split file version: 1
    
    3) Number of splits
    
    4) Information about each split: 
       a) Locations of the split (a split can be present in 3 locations, if the replication factor is 3), 
       b) start offset
       c) length of the split. 
    

You can find more information about SplitMetaInfo class here: JobSplit.java

Upvotes: 1

Related Questions