How do I process XML file in spark?

Question

Learning spark and scala. I have a snippet that process xml literal. But when I try to load the xml from file, I couldn't make it work. Probably I am missing a key understanding. Would appreciate some help. I am using cloudera VM and it has spark 1.6 & scala 2.10.5.

Scenario: Read xml, extract id, name and display as id@name.

scala> import scala.xml._
scala> val strxml = 
     | 1chris
     | 2adam
     | 3karl
     | 
strxml: scala.xml.Elem = 

1chris
2adam
3karl


scala> val t = strxml.flatMap(line => line \ "employee")
t: scala.xml.NodeSeq = NodeSeq(1chris, 2adam, 3karl)

scala> t.map(l => (l \ "id").text + "@" + (l \ "name").text).foreach(println)
1@chris
2@adam
3@karl

Loading it from a file (exception thrown; What am I doing wrong here?)

scala> val filexml = sc.wholeTextFiles("file:///home/cloudera/test*")
filexml: org.apache.spark.rdd.RDD[(String, String)] = file:///home/cloudera/test* MapPartitionsRDD[66] at wholeTextFiles at :30

scala> val lines = filexml.map(line => XML.loadString(line._2))
lines: org.apache.spark.rdd.RDD[scala.xml.Elem] = MapPartitionsRDD[89] at map at :32

scala> val ft = lines.map(l => l \ "employee")
ft: org.apache.spark.rdd.RDD[scala.xml.NodeSeq] = MapPartitionsRDD[99] at map at :34

scala> ft.map(l => (l \ "id").text + "@" + (l \ "name").text).foreach(println)

Exception in task 0.0 in stage 63.0 (TID 63)
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog

Contents of files

test.xml


1chris
2adam
3karl


test2.xml


4hive
5elixir
6spark

Bala · Accepted Answer

Answering my own question.

scala> val filexml = sc.wholeTextFiles("file:///Volumes/BigData/sample_data/test*.xml")
filexml: org.apache.spark.rdd.RDD[(String, String)] = file:///Volumes/BigData/sample_data/test*.xml MapPartitionsRDD[1] at wholeTextFiles at :24

scala> val lines = filexml.flatMap(line => XML.loadString(line._2) \ "employee")
lines: org.apache.spark.rdd.RDD[scala.xml.Node] = MapPartitionsRDD[3] at flatMap at :29

scala> lines.map(line => (line \ "id").text + "@" + (line \ "name").text).foreach(println)
1@chris
2@adam
3@karl
4@hive
5@elixir
6@spark

How do I process XML file in spark?

Answers (2)

Related Questions