Sidi
Sidi

Reputation: 109

Scala String parser

I have a XML file in S3 contains the Schema for my table called sample:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<DATA  CHARSET="UTF8" DELIMITER="\t">
<COLUMNS>
<COLUMN DATA_PRECISION="10" DATA_SCALE="0" DATA_TYPE="NUMBER"  ID="APPLICATION_ID" />
<COLUMN DATA_LENGTH="40" DATA_TYPE="VARCHAR2" ID="DESCRIPTIVE_FLEXFIELD_NAME"/>
<COLUMN DATA_LENGTH="30" DATA_TYPE="VARCHAR2" ID="LANGUAGE"/>
<COLUMN DATA_PRECISION="15" DATA_SCALE="0" DATA_TYPE="NUMBER" ID="CREATED_BY" />
<COLUMN DATA_PRECISION="10" DATA_SCALE="0" DATA_TYPE="NUMBER" ID="LAST_UPDATE_LOGIN" />

</COLUMNS>
</DATA>

I already write script as sample.filter( x => x.contains("DATA_TYPE") || x.contains("ID")), and I would need to get each pair of the value for (ID,DATA_TYPE) so the final output should be like

("APPLICATION_ID","NUMBER"),("DESCRIPTIVE_FLEXFIELD_NAME","VARCHAR2"),etc.

Anyone can help me on this?

Thanks!!!

Upvotes: 0

Views: 99

Answers (2)

philantrovert
philantrovert

Reputation: 10092

While this can be done with spark-xml in an easier and more robust way. You can also use the scala.xml API to parse:

import scala.xml._

val rdd = sc.textFile("file.xml").filter( x => x.contains("DATA_TYPE") || x.contains("ID"))

rdd.map(XML.loadString(_) )
   .map(node => ( (node \\ "@ID").text , (node \\ "@DATA_TYPE").text  ) )
   .collect.foreach(println)

// (APPLICATION_ID,NUMBER)
// (DESCRIPTIVE_FLEXFIELD_NAME,VARCHAR2)
// (LANGUAGE,VARCHAR2)
// (CREATED_BY,NUMBER)
// (LAST_UPDATE_LOGIN,NUMBER)

Upvotes: 1

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

You can use xml reader in spark as

val df = sqlContext.read
  .format("com.databricks.spark.xml")
  .option("rootTag", "DATA")
  .option("rowTag", "COLUMNS")
  .load("path to the xml file")

And apply some inbuilt functions such as explode and select as

import org.apache.spark.sql.functions._
df.select(explode(col("COLUMN")))
  .select(col("col.*"))
  .filter(col("_DATA_TYPE").isNotNull || col("_ID").isNotNull)
  .select("_DATA_TYPE", "_ID")

which should give you

+----------+--------------------------+
|_DATA_TYPE|_ID                       |
+----------+--------------------------+
|NUMBER    |APPLICATION_ID            |
|VARCHAR2  |DESCRIPTIVE_FLEXFIELD_NAME|
|VARCHAR2  |LANGUAGE                  |
|NUMBER    |CREATED_BY                |
|NUMBER    |LAST_UPDATE_LOGIN         |
+----------+--------------------------+

Now you can convert it to rdd and parse as you want or leave it as dataframe and apply other operations on it.

Upvotes: 1

Related Questions