Reputation: 1455
I have the following XML:
<TABLES>
<TABLE attrname="Red">
<ROWDATA>
<ROW Type="solid" track="0" Unit="0"/>
</ROWDATA>
</TABLE>
<TABLE attrname="Blue">
<ROWDATA>
<ROW Type="light" track="0" Unit="0"/>
<ROW Type="solid" track="0" Unit="0"/>
<ROW Type="solid" track="0" Unit="0"/>
</ROWDATA>
</TABLE>
I am using Spark and Scala. I want to read each field in the ROW tag and differentiate by the attribute names. Currently the code below just reads all the values inside the ROW tag but I want to read them based on the attribute names.
val df = session.read
.option("rowTag", "ROW")
.xml(filePath)
df.show(10)
df.printSchema()
Thanks in advance.
Upvotes: 1
Views: 181
Reputation: 10382
Check below code.
val spark = SparkSession.builder().master("local").appName("xml").getOrCreate()
import com.databricks.spark.xml._
import org.apache.spark.sql.functions._
import spark.implicits._
val xmlDF = spark.read
.option("rowTag", "TABLE")
.xml(xmlPath)
.select(explode_outer($"ROWDATA.ROW").as("row"),$"_attrname".as("attrname"))
.select(
$"row._Type".as("type"),
$"row._VALUE".as("value"),
$"row._Unit".as("unit"),
$"row._track".as("track"),
$"attrname"
)
xmlDF.printSchema()
xmlDF.show(false)
Schema
root
|-- type: string (nullable = true)
|-- value: string (nullable = true)
|-- unit: long (nullable = true)
|-- track: long (nullable = true)
|-- attrname: string (nullable = true)
Sample Data
+-----+-----+----+-----+--------+
|type |value|unit|track|attrname|
+-----+-----+----+-----+--------+
|solid|null |0 |0 |Red |
|light|null |0 |0 |Blue |
|solid|null |0 |0 |Blue |
|solid|null |0 |0 |Blue |
+-----+-----+----+-----+--------+
Upvotes: 2