Parsing millions of small XML files

Question

I have 10 million small XML files(300KB-500KB).i'm using Mahaout's XML input format in Mapreduce to read the data and i'm using SAX Parser for parsing. But Processing is very slow.will using compression(lzo) of input files help in increse performance?Each folder contains 80-90k xml files and when i start the process it run mapper for each file.is there any way to reduce no of mappers?

RojoSam · Accepted Answer

Hadoop doesn't work very well with a huge amount of small files. It was designed to deal with few very big files.

Compress your files won't help because as you have noticed the problem is that your job require to instantiate a lot of containers to execute the maps (one for each file). Instantiate containers could take more than the time required to process the input (and a lot of resources like memory and CPU).

I'm not familiar with Mahaout's input formats but in hadoop there is a class that minimize that problem combining several inputs in one Mapper. The class is CombineTextInputFormat. To work with XML's you may require to create your own XMLInputFormat extending CombineFileInputFormat.

Another alternative but with less imprvement could be reuse the JVM among the containers: reuse JVM in Hadoop mapreduce jobs

Reusing the JVM safe the time required to create each JVM but you are still requiring create one container for each file.

Parsing millions of small XML files

Answers (2)

Related Questions