Reputation: 444
I am using CompositeInputFormat to provide input to a hadoop job.
The number of splits generated is the total number of files given as input to CompositeInputFormat ( for joining ).
The job is completely ignoring the block size and max split size ( while taking input from CompositeInputFormat). This is resulting into long running Map Tasks and is making system slow as the input files are larger than the block size.
Is anyone aware of any way through which the number of splits can be managed for CompositeInputFormat?
Upvotes: 3
Views: 2143
Reputation: 39893
Unfortunately, CompositeInputFormat has to ignore the block/split size. In CompositeInputFormat, the input files need to be sorted and partitioned identically... therefore, Hadoop has no way to determine where to split the file to maintain this property. It has no way to determine where to split the file to keep the files organized.
The only way to get around this is to split and partition the files manually into smaller splits. You can do this by passing the data through a mapreduce job (probably just identity mapper and identity reducer) with a larger amount of reducers. Just be sure to pass both of your data sets through with the same number of reducers.
Upvotes: 6