Sindu_
Sindu_

Reputation: 1455

Read XML with attribute names in Scala

I have the following XML:

<TABLES>
    <TABLE attrname="Red">
        <ROWDATA>
            <ROW Type="solid" track="0" Unit="0"/>
        </ROWDATA>
    </TABLE>
    <TABLE attrname="Blue">
        <ROWDATA>
            <ROW Type="light" track="0" Unit="0"/>
            <ROW Type="solid" track="0" Unit="0"/>
            <ROW Type="solid" track="0" Unit="0"/>
        </ROWDATA>
    </TABLE>

I am using Spark and Scala. I want to read each field in the ROW tag and differentiate by the attribute names. Currently the code below just reads all the values inside the ROW tag but I want to read them based on the attribute names.

val df = session.read
  .option("rowTag", "ROW")
  .xml(filePath)

df.show(10)
df.printSchema()

Thanks in advance.

Upvotes: 1

Views: 181

Answers (1)

s.polam
s.polam

Reputation: 10382

Check below code.

 val spark = SparkSession.builder().master("local").appName("xml").getOrCreate()

  import com.databricks.spark.xml._
  import org.apache.spark.sql.functions._
  import spark.implicits._

   val xmlDF = spark.read
     .option("rowTag", "TABLE")
     .xml(xmlPath)
     .select(explode_outer($"ROWDATA.ROW").as("row"),$"_attrname".as("attrname"))
     .select(
       $"row._Type".as("type"),
       $"row._VALUE".as("value"),
       $"row._Unit".as("unit"),
       $"row._track".as("track"),
       $"attrname"
     )

  xmlDF.printSchema()
  xmlDF.show(false)

Schema

root
 |-- type: string (nullable = true)
 |-- value: string (nullable = true)
 |-- unit: long (nullable = true)
 |-- track: long (nullable = true)
 |-- attrname: string (nullable = true)

Sample Data

+-----+-----+----+-----+--------+
|type |value|unit|track|attrname|
+-----+-----+----+-----+--------+
|solid|null |0   |0    |Red     |
|light|null |0   |0    |Blue    |
|solid|null |0   |0    |Blue    |
|solid|null |0   |0    |Blue    |
+-----+-----+----+-----+--------+

Upvotes: 2

Related Questions