How does XML be parsed in hadoop in parallel

Question

If I have a large XML file, and I want to process it in parallel. 'Hadoop in practice' use mahout XMLInputFormat, and I find the getSplits() method is not overrided. In other words, it's using TextInputFormat's getSplits() method. And how does this method avoid splitting the file in the position of begin-tag and end-tag?

When I have a XML file like this. And two mappers are used to process the XML file in parallel.

             
                
     John  
     12
   

   . . . . . . .

   
                      ------- until here as the first FileSplit
     Amy
     14
   

   . . . . . . .

   
     Dan
     12
   
           ------- remaining as the second FileSplit

When the mapper takes the second FileSplit, it can't recognize the Amy record, because it can't find the begin tag.

Cl&#233;ment MATHIEU · Accepted Answer

Not sure to understand the question.

XMLInputFormat does something very similar to TextInputFormat but rather than splitting on end of line it uses xmlinput.start and xmlinput.end as delimiters. This class is very naïve and do not parse XML and anything complex, it only does dumb pattern matching to find the record boundaries.

The implementation is kind of straightforward but you have to really understand what a splits and records are.

A split is a part of a file, defined by a start and a end offset, that will be processed by a mapper. It does not need to be exactly aligned with records. It is a coarse grained thing and the RecordReader will take care of the "exact offsets". For example, TextInputFormat computes the splits based on mapred.max.split.size. It does not actually read the file. It only does very simple maths based on this variable and the file size (can be a bit more complex than that, because of compression for example, but you get the idea).

A record is the thing that will be passed as a to your mapper. A record reader is in charge of extracting the records from a split. It is an easy task, for TextInputFormat it will just look for the next end-of-line characters. XMLinputFormat does some very simple pattern matching.

The only issue to resolve is that the start offset of a split can not be aligned with the start of a record. Same thing for end of split. This is very easily resolved, but a simple algorithm: A record reader skips the bytes after the start offset until it find the first record delimiter and processes the bytes after the end offset until a record delimiter is found.

That's why you don't need to override getSplits in XMLInputFormat. The coarse grained split is exactly the same "please split this file in 10MB parts". The fine grained split done by the RecordReader is "please extract each block from this split".

To configure XMLInputFormat you have to set the xmlinput.(start|end) properties in the configuration.

How does XML be parsed in hadoop in parallel

Answers (2)

Related Questions