Extracting row tag schema from StructType in Scala to parse nested XML

Question

I'm trying to parse a wide, nested XML file into a DataFrame using the spark-xml library.

Here is an abbreviated schema definition (XSD):

...

The XML file containing the data would looks something like this:

Now, what's clear is that the RowTag needs to be Item, but I've encountered an issue regarding the XSD. The row schema is encapsulated within the document schema.

import com.databricks.spark.xml.util.XSDToSchema
import com.databricks.spark.xml._
import java.nio.file.Paths
import org.apache.spark.sql.functions._

val inputFile = "dbfs:/samples/ItemExport.xml"
val schema = XSDToSchema.read(Paths.get("/dbfs/samples/ItemExport.xsd"))
val df1 = spark.read.option("rowTag", "Item").xml(inputFile)
val df2 = spark.read.schema(schema).xml(inputFile)

I basically want to get the struct under Item under the root element, not the entire document schema.

schema.printTreeString

root
|-- ItemExport: struct (nullable = false)
|    |-- Item: struct (nullable = false)
|    |    |-- ITEM_ID: integer (nullable = false)
|    |    |-- CONTEXT: string (nullable = false)
|    |    |-- TYPE: string (nullable = false)
...(a few more fields...)
|    |    |-- CLASSIFICATIONS: struct (nullable = false)
|    |    |    |-- CLASSIFICATION: array (nullable = false)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- CLASS_SCHEME: string (nullable = false)
|    |    |    |    |    |-- CLASS_LEVEL: string (nullable = false)
|    |    |    |    |    |-- CLASS_CODE: string (nullable = false)
|    |    |    |    |    |-- CLASS_CODE_NAME: string (nullable = false)
|    |    |    |    |    |-- EFFECTIVE_FROM: timestamp (nullable = false)
|    |    |    |    |    |-- EFFECTIVE_TO: timestamp (nullable = false)

In the case above, parsing with the document schema yields an empty DataFrame:

df2.show()

+-----------+
| ItemExport|
+-----------+
+-----------+

while the inferred schema is basically correct, but it can only infer nested columns when they are present (which is not always the case):

df1.show()

+----------+--------------------+----------+---------------+
|   ITEM_ID|             CONTEXT|      TYPE|CLASSIFICATIONS|
+----------+--------------------+----------+---------------+
|        56|            Sample  |   Product|         {null}|
|        57|            Sample  |   Product|         {null}|
|        59|              Part  | Component|         {null}|
|        60|              Part  | Component|         {null}|
|        61|            Sample  |   Product|         {null}|
|        62|            Sample  |   Product|         {null}|
|        63|          Assembly  |   Product|         {null}|

df1.printSchema

root
|-- ITEM_ID: long (nullable = true)
|-- CONTEXT: string (nullable = false)
|-- TYPE: string (nullable = true)
...
|-- CLASSIFICATIONS: struct (nullable = true)
|    |-- CLASSIFICATION: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- CLASS_CODE: long (nullable = true)
|    |    |    |-- CLASS_CODE_NAME: string (nullable = true)
|    |    |    |-- CLASS_LEVEL: long (nullable = true)
|    |    |    |-- CLASS_SCHEME: string (nullable = true)
|    |    |    |-- EFFECTIVE_FROM: string (nullable = true)
|    |    |    |-- EFFECTIVE_TO: string (nullable = true)

As described here and in the XML library docs ("Path to an XSD file that is used to validate the XML for each row individually"), I can parse into a given row-level schema as such:

import org.apache.spark.sql.types._

val structschema = StructType(
  Array(
    StructField("ITEM_ID",IntegerType,false), 
    StructField("CONTEXT",StringType,false), 
    StructField("TYPE",StringType,false),
  )
)

val df_struct = spark.read.schema(structschema).option("rowTag", "Item").xml(inputFile)

I'd like to obtain the schema for the nested columns from the XSD however. How to go about this given the schema?

Version info: Scala 2.12, Spark 3.1.1, spark-xml 0.12.0

s.polam · Accepted Answer

Columns in XSD are required or not null & Some of the columns in XML file is null to match XSD & XML file content, change schema from nullable=false to nullable=true

Try following code.

  import com.databricks.spark.xml.util.XSDToSchema
  import com.databricks.spark.xml._
  import java.nio.file.Paths
  import org.apache.spark.sql.functions._

  val inputFile = "dbfs:/samples/ItemExport.xml"

Getting schema from XSD, Applying same schema to an empty dataframe to get required columns.

 val schema = spark
    .createDataFrame(
      spark
        .sparkContext
        .emptyRDD[Row],
      XSDToSchema
        .read(Paths.get("/dbfs/samples/ItemExport.xsd"))
    )
    .select("ItemExport.Item.*")
    .schema


  val df2 = spark.read
    .option("rootTag", "ItemExport")
    .option("rowTag", "Item")
    .schema(setNullable(schema, true)) // To match XSD & XML file content setting all columns are optional i.e nullable=true
    .xml(inputFile)

Below function will change all columns optional or nullable=true

  def setNullable(schema: StructType, nullable:Boolean = false): StructType = {
    def recurNullable(schema: StructType): Seq[StructField] =
      schema.fields.map{
        case StructField(name, dtype: StructType, _, meta) =>
          StructField(name, StructType(recurNullable(dtype)), nullable, meta)
        case StructField(name, dtype: ArrayType, _, meta) => dtype.elementType match {
          case struct: StructType => StructField(name, ArrayType(StructType(recurNullable(struct)), true), nullable, meta)
          case other => StructField(name, other, nullable, meta)
        }
        case StructField(name, dtype, _, meta) =>
          StructField(name, dtype, nullable, meta)
      }

    StructType(recurNullable(schema))
  }

Extracting row tag schema from StructType in Scala to parse nested XML

Answers (2)

Update

Related Questions