pad11
pad11

Reputation: 311

Extracting elements from XML records using Spark / Scala

I'm trying to extract elements from XML records where each xml file has many XML records. Below is the modified code and sample xmls that I'm using.

I'm expecting an array of Strings where each element of the array is "user:id" but the result is ":". I was expecting XML.loadString to parse each file and the result would be separate XML records. Meaning if I take the two sample files as example I would end up with 4 XML records. As it is, it's two.

After adding a println(d) after getting next what I get is the entire string that represents the file which is likely why the getId and getUser functions are not returning anything.

Am I handling the load incorrectly?

import org.apache.spark.{SparkConf, SparkContext}
import scala.xml._
import scala.collection.mutable.ArrayBuffer

object Details {

    def getDetails(xmlstring: String): Iterator[Node] = {
        val nodes = XML.loadString(xmlstring)
        nodes.toIterator
    }

    def getId(detail: Node): String = {
        (detail \ "id").text
    }

    def getUser(detail: Node): String = {
        (detail \ "user").text
    }

    def getDetailList(details: Iterator[Node]): Array[String] = {
        var list = ArrayBuffer[String]()
        while (details.hasNext) {
            val d = details.next
            val user = getUser(d)
            val id = getId(d)
            val formattedText = user + ":" + id
            list += formattedText
        }
        list.toArray
    }

    def main(args: Array[String]) {

        val conf = new SparkConf().setAppName("Details")
        val sc: SparkContext = new SparkContext(conf)

        val lines = sc.wholeTextFiles("file:///path/to/files/")
        val xmlStrings = lines.map(line => line._2)
        val detailsRecords = xmlStrings.map(getDetails)
        val detailsList = detailsRecords.map(getDetailList)

        spark.stop()
    }
}

And two sample files...

test.xml

<details>
  <detail>
    <user>Dan</user>
    <id>5555</id>
  </detail>
  <detail>
    <user>Mike</user>
    <id>6666</id>
  </detail>
</details>

test2.xml

<details>
  <detail>
    <user>John</user>
    <id>1234</id>
  </detail>
  <detail>
    <user>Joe</user>
    <id>5678</id>
  </detail>
</details>

Upvotes: 2

Views: 3005

Answers (2)

tpysz5n
tpysz5n

Reputation: 11

It's been 4 months late but I think I got just the answer for you.

The problem lies in the getDetails() function. You have to tell Scala what is defined as a "node", which is <detail> in this case. So just modify your code as below:

  def getDetails(xmlstring: String): Iterator[Node] = {
    val nodes = XML.loadString(xmlstring) \\ "detail"
    nodes.toIterator
  }

Appending \\ "detail" at the end of XML.loadString() is all you need to get the code working as you expect.

Cheers,

Upvotes: 1

illak zapata
illak zapata

Reputation: 66

You should use XML for Spark.

With this library you can read all your xml files like this:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)

val df = sqlContext.read
   .format("com.databricks.spark.xml")
   .option("rowTag", "detail")
   .load("/home/path-with-xml-files")

This generates a DataFrame with schema:

+----+----+
|  id|user|
+----+----+
|5555| Dan|
|6666|Mike|
|1234|John|
|5678| Joe|
+----+----+

Then get an array from this DF:

val id_users_array = df.collect

This array has the type:

id_users_array: Array[org.apache.spark.sql.Row] = Array([5555,Dan], [6666,Mike], [1234,John], [5678,Joe])

If you want to print only the ids:

id_users_array.map(r => r.get(0)).foreach(println)

outputs:

5555
6666
1234
5678

Hope this helps.

Upvotes: 1

Related Questions