Chhaya Vishwakarma
Chhaya Vishwakarma

Reputation: 1437

Processing XML with Hadoop MapReduce

I want to load and parse some petabytes of XML data. After doing lot of research on how to process XML in hadoop I have come to know that XML has to be processed as whole file in Map Reduce.

If i feed whole XML as single input split to my Map Reduce then It will not be utilizing hadoop's distributed and parallel processing feature as only one Mapper will be doing processing.

Is that I correctly understood? How to overcome this problem?

Please suggest

Upvotes: 2

Views: 7309

Answers (2)

vy32
vy32

Reputation: 29687

If you have a single block of XML data that is a petabyte in size, you have a problem. More likely you have millions or billions of individual XML records. If that is the case, you have a rather straightforward approach: create millions of XML files that have a size that is roughly the same (a little smaller) than the block size of your HDFS system. Then write a set of MapReduce jobs where the first mapper extracts the XML data and outputs whatever (name,value) pairs are useful, and the reducer collects all of the different (name) pairs from the various XML files that require correlation.

If the XML dataset is changing over time you may wish to look at support for streaming datasets.

Upvotes: 0

Ashrith
Ashrith

Reputation: 6855

You could try and use Mahout's XMLInputFormat. XMLInputFormat takes care of figuring out the record boundaries with in your XML input files using the specified start and end tags.

You could use this link as reference on how to use XMLInputFormat to parse your XML files.

Upvotes: 2

Related Questions