jldupont
jldupont

Reputation: 96836

hadoop job to split xml files

I've got 1000's of files to process. Each file consists of 1000's of XML files concatenated together.

I'd like to use Hadoop to split each XML file separately. What would be a good way of doing this using Hadoop?

NOTES: I am total Hadoop newbie. I plan on using Amazon EMR.

Upvotes: 1

Views: 687

Answers (1)

Donald Miner
Donald Miner

Reputation: 39933

Check out Mahout's XmlInputFormat. It's a shame that this is in Mahout and not in the core distribution.

Are the XML files that are concatenated at least in the same format? If so, you set START_TAG_KEY and END_TAG_KEY to the root in each of your files. Each file will show up as one Text record in the map. Then, you can use your favorite Java XML parser to finish the job.

Upvotes: 3

Related Questions