Spark master does not invoke Custom InputFormat

Question

Am trying to explore Apache Spark, as part of that I wanted to customize the InputFormat. In my case i wanted to read xml file and convert every occurence of to new record.

I did write customized TextInputFormat (XMLRecordInputFormat.java) that returns customized **XMLRecordReader extends org.apache.hadoop.mapreduce.RecordReader**

But i dont understand why the Spark master does not invoke customized input format (XMLRecordInputFormat.class)? For some reason it continues to behave like normal Line splitter.

Following is the code:

import java.util.Iterator;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobConf;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;


import scala.Tuple2;

public class CustomizedXMLReader{

    public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf()
        .setMaster("local")
        .setAppName("CustomizedXMLReader")
        .set("spark.executor.memory", "512m").set("record.delimiter.regex", "");

        JobConf jobConf = new JobConf(new Configuration(), CustomizedXMLReader.class);
        jobConf.setInputFormat(XMLRecordInputFormat.class);
        FileInputFormat.setInputPaths(jobConf, new Path(args[0]));


        JavaSparkContext ctx = new JavaSparkContext(sparkConf);
        JavaPairRDD lines = ctx.hadoopRDD(jobConf,XMLRecordInputFormat.class, LongWritable.class, Text.class);



        Function, XMLRecord> keyData =
              new Function, XMLRecord>() {
                     @Override
                    public XMLRecord call(Tuple2 arg0)
                            throws Exception {
                        // TODO Auto-generated method stub
                         System.out.println(arg0.toString());
                         XMLRecord record = new XMLRecord();
                         record.setPos(Long.getLong(arg0._1.toString()));
                         record.setXml(arg0._2.toString());                      
                         return record;                         
                    }
                };

        JavaRDD words = lines.map(keyData);      

        List  tupleList = words.collect();    

        Iterator itr = tupleList.iterator();

        while(itr.hasNext()){
            XMLRecord t = itr.next();
            System.out.println(t.getXml());
            System.out.println(t.getPos());
        }
    }
}

//following custom InputFormat implementation

public class XMLRecordInputFormat extends TextInputFormat{

    public RecordReader createRecordReader(
            InputSplit arg0, JobConf arg1, Reporter arg2) throws IOException {
        // TODO Auto-generated method stub
        XMLRecordReader r = new XMLRecordReader();

        return r;
    }


}

Spark master does not invoke Custom InputFormat

Answers (1)

Related Questions