spark-xml: How to create a SparkSQL table from the atrributes of an XML file

Question

I have an XML file that looks like follows:

I want to create a SparkSQL table of the following fashion

name  number  value
XYZ    1      ABC
XYZ    1      ABCD
PQR    2      ABCDE
PQR    2      ABCDF

Is this technically possible?

I am using pySpark, and this is what I have done till now:

df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "report").load("//path//to//file.xml")
df.registerTempTable("XMLtable");

I got the schema as:

root
 |-- instance: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- @name: string (nullable = true)
 |    |    |-- @number: long (nullable = true)
 |    |    |-- key: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- @value: string (nullable = true)

I performed the following queries:

sqlContext.sql("select * from XMLtable").show();

+-------------+
|     INSTANCE|
+-------------+
|[[null,XYZ...|
+-------------+

This query correctly points me to the right data. 

sqlContext.sql("select instance[1].key[1]['@value'] as value from XMLtable").show();

+--------+
|  value |
+--------+
|(1)ABCDF|
+--------+

So as far as I understand, the table is not correctly loaded as expected. How will I load the table with attributes as opposed to the general way of loading values.

Can anybody point me towards the right direction? I have already checked without attribute values, and that works pretty well. What I need is to create a table like described above with the attributes values populated, which will inturn help in writing queries for the parsed XML data. I also plan to convert this into pandas dataframe later to do more statistical analysis of the data.

Mariusz · Accepted Answer

Yes, it's possible. First, specify instance instead of report for rowTag option during loading xml and add specify custom value for valueTag (to avoid conflicts with value attribute), for example:

df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "instance") \
        .option("valueTag", "some_value").load("data.xml")

Then you can easily get dataframe to requested format using:

>>> df.withColumn('values', explode('key')).select(
       col('_name').alias('name'), 
       col('_number').alias('number'), 
       col('values._value').alias('value')
    ).show()
+----+------+-----+
|name|number|value|
+----+------+-----+
| XYZ|     1|  ABC|
| XYZ|     1| ABCD|
| PQR|     2|ABCDE|
| PQR|     2|ABCDF|
+----+------+-----+

Finally, register this dataframe as temp table and it will behave like you expected.

spark-xml: How to create a SparkSQL table from the atrributes of an XML file

Answers (1)

Related Questions