Reputation: 545
I am new to Spark and would like to load page records from a Wikipedia dump into an RDD.
I tried using a record reader provided in hadoop streaming but can't figure out how to use it. Could anyone help me make the following code create a nice RDD with page records?
import org.apache.hadoop.io.Text
import org.apache.hadoop.streaming.StreamXmlRecordReader
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object WikiTest {
def main(args: Array[String]) {
// configuration
val sparkConf = new SparkConf()
.setMaster("local[4]")
.setAppName("WikiDumpTest")
val jobConf = new JobConf()
jobConf.set("input", "enwikisource-20140906-pages-articles-multistream.xml")
jobConf.set("stream.recordreader.class", "org.apache.hadoop.streaming.StreamXmlRecordReader")
jobConf.set("stream.recordreader.begin", "<page>")
jobConf.set("stream.recordreader.end", "</page>")
val sparkContext = new SparkContext(sparkConf)
// read data
val wikiData = sparkContext.hadoopRDD(
jobConf,
classOf[StreamXmlRecordReader],
classOf[Text],
classOf[Text])
// count rows
println(wikiData.count)
}
}
It seems Spark refuses to use StreamXmlRecordReader. I get the following error:
[error] found : Class[org.apache.hadoop.streaming.StreamXmlRecordReader (classOf[org.apache.hadoop.streaming.StreamXmlRecordReader])
[error] required: Class[_ <: org.apache.hadoop.mapreduce.InputFormat[?,?]]
[error] classOf[StreamXmlRecordReader]
If I ignore Eclispse's warning and launch the prgramm anyway I hit a java.lang.ClassNotFoundException.
Upvotes: 3
Views: 3515
Reputation: 2117
You're having the java.lang.ClassNotFoundException
because you are trying to use an external dependency in spark (StreamXmlRecordReader
). You have to create a fat jar, and deploy it in Spark.
This is a good example on how to create this type of jar: gradle fat jar tutorial
Also you can have a look here if you have problems parsing the XML file: parsing tutorial
Upvotes: 2
Reputation: 20826
You should use classOf[org.apache.hadoop.streaming.StreamInputFormat]
instead of classOf[StreamXmlRecordReader]
.
java.lang.ClassNotFoundException
is because you want to run your class WikiTest
but it does not exist as it cannot be compiled.
Upvotes: 2