Reputation: 955

Parsing XML files in Hadoop

Hi I have installed hadoop-0.20.2-cdh3u5 in pseudo distributed mode on a VMware. I want to parse an XML file using this established environment. I can do that by writing map/reduce code and then exporting them as .jar files on to cluster and then execute them on the cluster. What I am not able to figure out is how can I put the java parsing code (using SAXON parser) for this into map/reduce classes and then generate the csv files in output.

So I have this parsing code: ( Using SAXon parser here )

import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Unmarshaller;

public class JAXBC {
    private JAXBContext context;
private Unmarshaller um;
public JAXBC() throws JAXBException
{
    // creating JAXB context and instantiating Marshaller
    JAXBContext context = JAXBContext.newInstance(ConnectHome.class);

    // get variables from the xml file
    um = context.createUnmarshaller();

}

  public ConnectHome convertJAXB(String strFilePath) throws FileNotFoundException,     
   JAXBException 
   { 
      return ((ConnectHome) um.unmarshal(new FileReader(strFilePath)));
   }
 }

I have XML something like this: ( Sample element here )

 <Course>
   <ID>1001</ID>
   <Seats>10</Seats>
   <Description>Department: CS , Faculty: XYZ</Description>
   <Faculty>
       <Name>XYZ</Name>
       <Age>30</Age>
   </Faculty>
 </Course>

Now my problem is I am not able to figure out how can I write this particular piece of code in map/reduce format. I had referred this particular tutorial a hadoop and various tutorials on yahoo.

So my question is can someone let me know how can I write such a map reduce code and then create a jar file out of it.

Let me know if other information is needed. I tried to be as short as I can.

Thanks in advance.

Note: I know this sounds like a very trivial question in mapreduce world and this XML which I had shown here is just a an example of a single tag having few tags inside it.

Upvotes: 1

Answers (2)

JCR000

Reputation: 976

For XML you generally want to put into protocol buffers like AVRO and process from there. The hadoop ecosystem grew up on processing unstructured data and transforming it into hdfs structured data... so intake and processing of structured data is not an intuitive part of the ecosystem yet. Mahout has some code for XML intake in its Bayes package that works much like Sree's answer.

Upvotes: 1

USB

Reputation: 6139

here is wat u want https://github.com/studhadoop/xmlparsing-hadoop/blob/master/XmlParser11.java

line 170 :if (currentElement.equalsIgnoreCase("name")) 
line 173 :else if (currentElement.equalsIgnoreCase("value"))

name and value are the tags in my xml file . In ur case if you need to process the tags inside FACULTY,u can use Name instead of name and Age instead of value.

conf.set("xmlinput.start", "<Faculty>");
 conf.set("xmlinput.end", "</Faculty>");

Upvotes: 1

Parsing XML files in Hadoop

Answers (2)

Related Questions