Reputation: 545

Custom input reader in spark

I am new to Spark and would like to load page records from a Wikipedia dump into an RDD.

I tried using a record reader provided in hadoop streaming but can't figure out how to use it. Could anyone help me make the following code create a nice RDD with page records?

import org.apache.hadoop.io.Text
import org.apache.hadoop.streaming.StreamXmlRecordReader

import org.apache.hadoop.mapred.JobConf
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WikiTest {

  def main(args: Array[String]) {

    // configuration
    val sparkConf = new SparkConf()
      .setMaster("local[4]")
      .setAppName("WikiDumpTest")

    val jobConf = new JobConf()
    jobConf.set("input", "enwikisource-20140906-pages-articles-multistream.xml")
    jobConf.set("stream.recordreader.class", "org.apache.hadoop.streaming.StreamXmlRecordReader")
    jobConf.set("stream.recordreader.begin", "<page>")
    jobConf.set("stream.recordreader.end", "</page>")
      
    val sparkContext = new SparkContext(sparkConf)
  
    // read data
    val wikiData = sparkContext.hadoopRDD(
        jobConf, 
        classOf[StreamXmlRecordReader],
        classOf[Text], 
        classOf[Text])

    // count rows
    println(wikiData.count)
  }
}

It seems Spark refuses to use StreamXmlRecordReader. I get the following error:

[error] found : Class[org.apache.hadoop.streaming.StreamXmlRecordReader (classOf[org.apache.hadoop.streaming.StreamXmlRecordReader])

[error] required: Class[_ <: org.apache.hadoop.mapreduce.InputFormat[?,?]]

[error] classOf[StreamXmlRecordReader]

If I ignore Eclispse's warning and launch the prgramm anyway I hit a java.lang.ClassNotFoundException.

Upvotes: 3

Answers (2)

djWann

Reputation: 2117

You're having the java.lang.ClassNotFoundException because you are trying to use an external dependency in spark (StreamXmlRecordReader). You have to create a fat jar, and deploy it in Spark.

This is a good example on how to create this type of jar: gradle fat jar tutorial

Also you can have a look here if you have problems parsing the XML file: parsing tutorial

Upvotes: 2

zsxwing

Reputation: 20826

You should use classOf[org.apache.hadoop.streaming.StreamInputFormat] instead of classOf[StreamXmlRecordReader].

java.lang.ClassNotFoundException is because you want to run your class WikiTest but it does not exist as it cannot be compiled.

Upvotes: 2

Custom input reader in spark

Answers (2)

Related Questions