Reputation: 96836
I've got 1000's of files to process. Each file consists of 1000's of XML files concatenated together.
I'd like to use Hadoop to split each XML file separately. What would be a good way of doing this using Hadoop?
NOTES: I am total Hadoop newbie. I plan on using Amazon EMR.
Upvotes: 1
Views: 687
Reputation: 39933
Check out Mahout's XmlInputFormat. It's a shame that this is in Mahout and not in the core distribution.
Are the XML files that are concatenated at least in the same format? If so, you set START_TAG_KEY
and END_TAG_KEY
to the root in each of your files. Each file will show up as one Text
record in the map
. Then, you can use your favorite Java XML parser to finish the job.
Upvotes: 3