N2M
N2M

Reputation: 199

spark-xml: How to create a SparkSQL table from the atrributes of an XML file

I have an XML file that looks like follows:

<report>
     <instance name="XYZ" number ="1">
          <key value="ABC"/>
          <key value="ABCD"/>
     </instance>
     <instance name="PQR" number ="2">
          <key value="ABCDE"/>
          <key value="ABCDF"/>
     </instance>
</report>

I want to create a SparkSQL table of the following fashion

name  number  value
XYZ    1      ABC
XYZ    1      ABCD
PQR    2      ABCDE
PQR    2      ABCDF

Is this technically possible?

I am using pySpark, and this is what I have done till now:

df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "report").load("//path//to//file.xml")
df.registerTempTable("XMLtable");

I got the schema as:

root
 |-- instance: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- @name: string (nullable = true)
 |    |    |-- @number: long (nullable = true)
 |    |    |-- key: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- @value: string (nullable = true)

I performed the following queries:

sqlContext.sql("select * from XMLtable").show();

+-------------+
|     INSTANCE|
+-------------+
|[[null,XYZ...|
+-------------+

This query correctly points me to the right data. 

sqlContext.sql("select instance[1].key[1]['@value'] as value from XMLtable").show();

+--------+
|  value |
+--------+
|(1)ABCDF|
+--------+

So as far as I understand, the table is not correctly loaded as expected. How will I load the table with attributes as opposed to the general way of loading values.

Can anybody point me towards the right direction? I have already checked without attribute values, and that works pretty well. What I need is to create a table like described above with the attributes values populated, which will inturn help in writing queries for the parsed XML data. I also plan to convert this into pandas dataframe later to do more statistical analysis of the data.

Upvotes: 1

Views: 4776

Answers (1)

Mariusz
Mariusz

Reputation: 13926

Yes, it's possible. First, specify instance instead of report for rowTag option during loading xml and add specify custom value for valueTag (to avoid conflicts with value attribute), for example:

df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "instance") \
        .option("valueTag", "some_value").load("data.xml")

Then you can easily get dataframe to requested format using:

>>> df.withColumn('values', explode('key')).select(
       col('_name').alias('name'), 
       col('_number').alias('number'), 
       col('values._value').alias('value')
    ).show()
+----+------+-----+
|name|number|value|
+----+------+-----+
| XYZ|     1|  ABC|
| XYZ|     1| ABCD|
| PQR|     2|ABCDE|
| PQR|     2|ABCDF|
+----+------+-----+

Finally, register this dataframe as temp table and it will behave like you expected.

Upvotes: 1

Related Questions