Reputation: 199
I have an XML file that looks like follows:
<report>
<instance name="XYZ" number ="1">
<key value="ABC"/>
<key value="ABCD"/>
</instance>
<instance name="PQR" number ="2">
<key value="ABCDE"/>
<key value="ABCDF"/>
</instance>
</report>
I want to create a SparkSQL table of the following fashion
name number value
XYZ 1 ABC
XYZ 1 ABCD
PQR 2 ABCDE
PQR 2 ABCDF
Is this technically possible?
I am using pySpark, and this is what I have done till now:
df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "report").load("//path//to//file.xml")
df.registerTempTable("XMLtable");
I got the schema as:
root
|-- instance: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- @name: string (nullable = true)
| | |-- @number: long (nullable = true)
| | |-- key: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- @value: string (nullable = true)
I performed the following queries:
sqlContext.sql("select * from XMLtable").show();
+-------------+
| INSTANCE|
+-------------+
|[[null,XYZ...|
+-------------+
This query correctly points me to the right data.
sqlContext.sql("select instance[1].key[1]['@value'] as value from XMLtable").show();
+--------+
| value |
+--------+
|(1)ABCDF|
+--------+
So as far as I understand, the table is not correctly loaded as expected. How will I load the table with attributes as opposed to the general way of loading values.
Can anybody point me towards the right direction? I have already checked without attribute values, and that works pretty well. What I need is to create a table like described above with the attributes values populated, which will inturn help in writing queries for the parsed XML data. I also plan to convert this into pandas dataframe later to do more statistical analysis of the data.
Upvotes: 1
Views: 4776
Reputation: 13926
Yes, it's possible. First, specify instance
instead of report
for rowTag
option during loading xml and add specify custom value for valueTag
(to avoid conflicts with value
attribute), for example:
df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "instance") \
.option("valueTag", "some_value").load("data.xml")
Then you can easily get dataframe to requested format using:
>>> df.withColumn('values', explode('key')).select(
col('_name').alias('name'),
col('_number').alias('number'),
col('values._value').alias('value')
).show()
+----+------+-----+
|name|number|value|
+----+------+-----+
| XYZ| 1| ABC|
| XYZ| 1| ABCD|
| PQR| 2|ABCDE|
| PQR| 2|ABCDF|
+----+------+-----+
Finally, register this dataframe as temp table and it will behave like you expected.
Upvotes: 1