Reputation: 109
I have a XML file in S3 contains the Schema for my table called sample:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<DATA CHARSET="UTF8" DELIMITER="\t">
<COLUMNS>
<COLUMN DATA_PRECISION="10" DATA_SCALE="0" DATA_TYPE="NUMBER" ID="APPLICATION_ID" />
<COLUMN DATA_LENGTH="40" DATA_TYPE="VARCHAR2" ID="DESCRIPTIVE_FLEXFIELD_NAME"/>
<COLUMN DATA_LENGTH="30" DATA_TYPE="VARCHAR2" ID="LANGUAGE"/>
<COLUMN DATA_PRECISION="15" DATA_SCALE="0" DATA_TYPE="NUMBER" ID="CREATED_BY" />
<COLUMN DATA_PRECISION="10" DATA_SCALE="0" DATA_TYPE="NUMBER" ID="LAST_UPDATE_LOGIN" />
</COLUMNS>
</DATA>
I already write script as sample.filter( x => x.contains("DATA_TYPE") || x.contains("ID"))
, and I would need to get each pair of the value for (ID,DATA_TYPE) so the final output should be like
("APPLICATION_ID","NUMBER"),("DESCRIPTIVE_FLEXFIELD_NAME","VARCHAR2"),etc.
Anyone can help me on this?
Thanks!!!
Upvotes: 0
Views: 99
Reputation: 10092
While this can be done with spark-xml
in an easier and more robust way. You can also use the scala.xml
API to parse:
import scala.xml._
val rdd = sc.textFile("file.xml").filter( x => x.contains("DATA_TYPE") || x.contains("ID"))
rdd.map(XML.loadString(_) )
.map(node => ( (node \\ "@ID").text , (node \\ "@DATA_TYPE").text ) )
.collect.foreach(println)
// (APPLICATION_ID,NUMBER)
// (DESCRIPTIVE_FLEXFIELD_NAME,VARCHAR2)
// (LANGUAGE,VARCHAR2)
// (CREATED_BY,NUMBER)
// (LAST_UPDATE_LOGIN,NUMBER)
Upvotes: 1
Reputation: 41957
You can use xml reader in spark as
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rootTag", "DATA")
.option("rowTag", "COLUMNS")
.load("path to the xml file")
And apply some inbuilt functions such as explode
and select
as
import org.apache.spark.sql.functions._
df.select(explode(col("COLUMN")))
.select(col("col.*"))
.filter(col("_DATA_TYPE").isNotNull || col("_ID").isNotNull)
.select("_DATA_TYPE", "_ID")
which should give you
+----------+--------------------------+
|_DATA_TYPE|_ID |
+----------+--------------------------+
|NUMBER |APPLICATION_ID |
|VARCHAR2 |DESCRIPTIVE_FLEXFIELD_NAME|
|VARCHAR2 |LANGUAGE |
|NUMBER |CREATED_BY |
|NUMBER |LAST_UPDATE_LOGIN |
+----------+--------------------------+
Now you can convert it to rdd and parse as you want or leave it as dataframe and apply other operations on it.
Upvotes: 1