Why is my data frame loaded with blank values when reading XML using spark.xml? (Python/Databricks)

Question

I am reading an XML file using spark.xml in Python and ran into a seemingly very specific problem.
I was able to narrow to down the part of the XML that is producing the problem, but not why it is happening.

Here is the code to read in the XML file:

src = sqlContext.read.format("com.databricks.spark.xml").options(rowTag="root").load("file.xml")

Here is the schema read in from the XML using the spark.xml library, with rowTag = "root":

    root
 |-- AID: long (nullable = true)
 |-- RID: long (nullable = true)
 |-- XmlData: struct (nullable = true)
 |    |-- NC: struct (nullable = true)
 |    |    |-- RS: struct (nullable = true)
 |    |    |    |-- RD: struct (nullable = true)
 |    |    |    |    |-- CR: struct (nullable = true)
 |    |    |    |    |    |-- Addr1: string (nullable = true)
 |    |    |    |    |    |-- Addr2: string (nullable = true)
 |    |    |    |    |    |-- City: string (nullable = true)
 |    |    |    |    |    |-- InFile: string (nullable = true)
 |    |    |    |    |    |-- Name: string (nullable = true)
 |    |    |    |    |    |-- Phone: long (nullable = true)
 |    |    |    |    |    |-- State: string (nullable = true)
 |    |    |    |    |    |-- Zip: long (nullable = true)
 |    |    |    |    |-- SC: struct (nullable = true)
 |    |    |    |    |    |-- Class: string (nullable = true)
 |    |    |    |    |    |-- ClassType: string (nullable = true)
 |    |    |    |    |    |-- Message: string (nullable = true)
 |    |    |    |    |    |-- SC: long (nullable = true)
 |    |    |    |    |-- NC: long (nullable = true)
 |    |    |    |    |-- CRR: string (nullable = true)
 |    |    |    |    |-- RM: struct (nullable = true)
 |    |    |    |    |    |-- Addr1: string (nullable = true)
 |    |    |    |    |    |-- City: string (nullable = true)
 |    |    |    |    |    |-- MemberNo: string (nullable = true)
 |    |    |    |    |    |-- Name: string (nullable = true)
 |    |    |    |    |    |-- State: string (nullable = true)
 |    |    |    |    |    |-- Zip: long (nullable = true)
 |    |    |    |    |-- TL: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- _AvgDays: long (nullable = true)
 |    |    |    |    |    |    |-- _Comment: string (nullable = true)
 |    |    |    |    |    |    |-- _Current: long (nullable = true)
 |    |    |    |    |    |    |-- _HC: long (nullable = true)
 |    |    |    |    |    |    |-- _IndType: string (nullable = true)
 |    |    |    |    |    |    |-- _LS: long (nullable = true)
 |    |    |    |    |    |    |-- _OpenDate: long (nullable = true)
 |    |    |    |    |    |    |-- _PD120Day: long (nullable = true)
 |    |    |    |    |    |    |-- _PD30Day: long (nullable = true)
 |    |    |    |    |    |    |-- _PD60Day: long (nullable = true)
 |    |    |    |    |    |    |-- _PD90Day: long (nullable = true)
 |    |    |    |    |    |    |-- _Region: string (nullable = true)
 |    |    |    |    |    |    |-- _ReportDate: string (nullable = true)
 |    |    |    |    |    |    |-- _SourceID: long (nullable = true)
 |    |    |    |    |    |    |-- _TD: long (nullable = true)
 |    |    |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |-- Trends: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- _CurrentPct: double (nullable = true)
 |    |    |    |    |    |    |-- _LineCnt: long (nullable = true)
 |    |    |    |    |    |    |-- _PD120DayPct: double (nullable = true)
 |    |    |    |    |    |    |-- _PD30DayPct: double (nullable = true)
 |    |    |    |    |    |    |-- _PD60DayPct: double (nullable = true)
 |    |    |    |    |    |    |-- _PD90DayPct: double (nullable = true)
 |    |    |    |    |    |    |-- _PD: string (nullable = true)
 |    |    |    |    |    |    |-- _TD: long (nullable = true)
 |    |    |    |    |    |    |-- _VALUE: string (nullable = true)

There is more to this schema, but for some reason the XML parser does not get past this point. I am thinking there is an issue with the 'Trends' tag in the XML, but I cannot find it.

Here is a sample of a 'Trends' tag entry:

I understand that a closed tag such as this one could not be read in previous versions of the XML parser, but it works fine while using it in Databricks, and my other files with closed tags are read properly.

Here is the final result to further exemplify what I am explaining (I have 13 records in my XML):

+---------+--------+-------+
|AID      |RID     |XmlData|
+---------+--------+-------+
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
|     null|    null|   null|
+---------+--------+-------+

(This top level view is fine - my code later iterates through the structs/arrays in 'XMLData' - but this needs to be populated first of course)

Does anyone know why this would cause the schema to stop being loaded to completion when parsing?

Please note that I cannot explicitly define a schema. That would defeat the purpose of the project that I am working on - I must be able to infer the schema dynamically. Which, again, works fine for other files I am using.

CodeHunter · Accepted Answer

Cause:

So I was able to figure out why was this happening. You are likely to see this issue when the xml you are trying to convert to dataframe have inconsistent values. For example, something like below will have this issue:



  
      tp@xyz.com
      


  
  
      tp@xyz.com
      


  
  
      tp@xyz.com

Here, since you have value="" in each of the rowTag entry, we will likely to get null in the dataframe for all rows as it is not able to infer the data type. However, if you replace all value="" fields with some actual value, you will see that this issue do not occur.

Solution:

As per databricks link, this issue can be resolved by using option("nullValue", "") option while reading the xml file. So your command will look like this (I tried this in scala, should be similar in python):

var xmldf = sparkSession.read.format("com.databricks.spark.xml")
        .option("rowTag", rootTag).option("nullValue", "").load("/path/to/xml")

Why is my data frame loaded with blank values when reading XML using spark.xml? (Python/Databricks)

Answers (1)

Related Questions