Aryan087
Aryan087

Reputation: 526

Parsing millions of small XML files

I have 10 million small XML files(300KB-500KB).i'm using Mahaout's XML input format in Mapreduce to read the data and i'm using SAX Parser for parsing. But Processing is very slow.will using compression(lzo) of input files help in increse performance?Each folder contains 80-90k xml files and when i start the process it run mapper for each file.is there any way to reduce no of mappers?

Upvotes: 0

Views: 1664

Answers (2)

Ravindra babu
Ravindra babu

Reputation: 38950

You can follow one of the three approaches as quoted in this article:

  1. Hadoop Archive File (HAR)
  2. Sequence Files
  3. HBase

I have found article 1 and article 2, which list multiple solutions (I have removed some non-generic alternatives from these articles):

  1. Change the ingestion process/interval: Change the logic at source level to reduce large number of small files and try to generate small number of big files
  2. Batch file consolidation: When small files are unavoidable, file consolidation is most common solution. With this option you periodically run a simple, consolidating MapReduce job to read all of the small files in a folder and rewrite them into fewer larger files
  3. Sequence files: When there is a requirement to maintain the original filename, a very common approach is to use Sequence files. In this solution, the filename is stored as the key in the sequence file and the file contents are stored as the value
  4. HBase: Instead of writing file to disk,write the file to HBase memory store.
  5. Using a CombineFileInputFormat: The CombineFileInputFormat is an abstract class provided by Hadoop that merges small files at MapReduce read time. The merged files are not persisted to disk. Instead, the process reads multiple files and merges them “on the fly” for consumption by a single map task.

Upvotes: 1

RojoSam
RojoSam

Reputation: 1496

Hadoop doesn't work very well with a huge amount of small files. It was designed to deal with few very big files.

Compress your files won't help because as you have noticed the problem is that your job require to instantiate a lot of containers to execute the maps (one for each file). Instantiate containers could take more than the time required to process the input (and a lot of resources like memory and CPU).

I'm not familiar with Mahaout's input formats but in hadoop there is a class that minimize that problem combining several inputs in one Mapper. The class is CombineTextInputFormat. To work with XML's you may require to create your own XMLInputFormat extending CombineFileInputFormat.

Another alternative but with less imprvement could be reuse the JVM among the containers: reuse JVM in Hadoop mapreduce jobs

Reusing the JVM safe the time required to create each JVM but you are still requiring create one container for each file.

Upvotes: 2

Related Questions