Reputation: 5460
I need to parse few XML's to TSV, the Size of the XML Files is of the order of 50 GB, I am basically doubtful about the implemetation i should choose to parse this i have two oprions
i have a fair bit of idea about SAXParser implementaion but i think having access to Hadoop cluster, i should use Hadoop as this is what hadoop is for i.e. Big Data
it would be great someone could provide a hint/doc as how to do this in Hadoop or efficient SAXParser implementaion for such a big file or rather what should i go for Hadoop or SAXparser?
Upvotes: 0
Views: 2071
Reputation: 3377
I think that SAX has traditionally been mistakenly associated with processing big XML files... in reality, VTD-XML is often the best option, far better than SAX in terms of performance, flexibility, code readability and maintainability... on the issue of memory, VTD-XML's in-memory model is only 1.3x~1.5X the size of the corresponding XML document.
VTD-XML has another significant benefit over SAX: its unparalleled XPath support. Because of it, VTD-XML users routinely report performance gain of 10 to 60x over SAX parsing over hundreds of MB XML files.
http://www.infoq.com/articles/HIgh-Performance-Parsers-in-Java#anch104307
Read this paper that comprehensively compares the existing XML parsing frameworks in Java.
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf
Upvotes: 0
Reputation: 36
I process large XML files in Hadoop quite regularly. I found it to be the best way (not the only way... the other is to write SAX code) since you can still operate on the records in a dom-like fashion.
With these large files, one thing to keep in mind is that you'll most definitely want to enable compression on the mapper output: Hadoop, how to compress mapper output but not the reducer output... this will speed things up quite a bit.
I've written a quick outline of how I've handled all this, maybe it'll help: http://davidvhill.com/article/processing-xml-with-hadoop-streaming. I use Python and Etrees which makes things really simple....
Upvotes: 2
Reputation: 8088
It is rilatively trivial to process XML on hadoop by having one mapper per XML file. This approach will be fine for large number of relatively small XMLs
The problem is that in Your case files are big and thier number is small so without splitting hadoop benefit will be limited. Taking to account hadoop's overhead the benefit be negative...
In hadoop we need to be able to split input files into logical parts (called splits) to efficiently process large files.
In general XML is not looks like "spliitable" format since there is no well defined division into blocks, which can be processed independently. In the same time, if XML contains "records" of some kind splitting can be implemented.
Good discussion about splitting XMLs in haoop is here:
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html
where Mahout's XML input format is suggested.
Regarding your case - I think as long as number of your files is not much bigger then number of cores you have on single system - hadoop will not be efficient solution.
In the same time - if you want to accumulate them over time - you can profit from hadoop as a scalable storage also.
Upvotes: 0
Reputation: 546
I don't know about SAXparser. But definitely Hadoop will do your job if you have a hadoop cluster with enough data nodes. 50Gb is nothing as I was performing operations on more than 300GB of data on my cluster. Write a map reduce job in java and the documentation for hadoop can be found at http://hadoop.apache.org/
Upvotes: 0