Reputation: 526
I have 10 million small XML files(300KB-500KB).i'm using Mahaout's XML input format in Mapreduce to read the data and i'm using SAX Parser for parsing. But Processing is very slow.will using compression(lzo) of input files help in increse performance?Each folder contains 80-90k xml files and when i start the process it run mapper for each file.is there any way to reduce no of mappers?
Upvotes: 0
Views: 1664
Reputation: 38950
You can follow one of the three approaches as quoted in this article:
I have found article 1 and article 2, which list multiple solutions (I have removed some non-generic alternatives from these articles):
CombineFileInputFormat
: The CombineFileInputFormat
is an abstract class provided by Hadoop that merges small files at MapReduce read time. The merged files are not persisted to disk. Instead, the process reads multiple files and merges them “on the fly” for consumption by a single map task.Upvotes: 1
Reputation: 1496
Hadoop doesn't work very well with a huge amount of small files. It was designed to deal with few very big files.
Compress your files won't help because as you have noticed the problem is that your job require to instantiate a lot of containers to execute the maps (one for each file). Instantiate containers could take more than the time required to process the input (and a lot of resources like memory and CPU).
I'm not familiar with Mahaout's input formats but in hadoop there is a class that minimize that problem combining several inputs in one Mapper. The class is CombineTextInputFormat. To work with XML's you may require to create your own XMLInputFormat extending CombineFileInputFormat.
Another alternative but with less imprvement could be reuse the JVM among the containers: reuse JVM in Hadoop mapreduce jobs
Reusing the JVM safe the time required to create each JVM but you are still requiring create one container for each file.
Upvotes: 2