Pyspark issue loading xml files with com.databricks:spark-xml

Question

I'm trying to push some academic POC to work that rely on pyspark with com.databricks:spark-xml. The goal is to load the Stack Exchange Data Dump xml format (https://archive.org/details/stackexchange) to pyspark df.

It works like a charm with correctly formatted xml with proper tags but fail with Stack Exchange Dump as follows:

Depending on the root tag, row tag I'm getting empty schema or..something:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "users").load('./tmp/test/Users.xml')
df.printSchema()
df.show()

root
 |-- row: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _AboutMe: string (nullable = true)
 |    |    |-- _AccountId: long (nullable = true)
 |    |    |-- _CreationDate: string (nullable = true)
 |    |    |-- _DisplayName: string (nullable = true)
 |    |    |-- _DownVotes: long (nullable = true)
 |    |    |-- _Id: long (nullable = true)
 |    |    |-- _LastAccessDate: string (nullable = true)
 |    |    |-- _Location: string (nullable = true)
 |    |    |-- _ProfileImageUrl: string (nullable = true)
 |    |    |-- _Reputation: long (nullable = true)
 |    |    |-- _UpVotes: long (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _Views: long (nullable = true)
 |    |    |-- _WebsiteUrl: string (nullable = true)

+--------------------+
|                 row|
+--------------------+
|[[Hi, I'm not ......|
+--------------------+

Spark          : 1.6.0
Python         : 2.7.15
Com.databricks : spark-xml_2.10:0.4.1

I would be extremely grateful for any advise.

Kind Regards, P.

Pyspark issue loading xml files with com.databricks:spark-xml

Answers (1)

Related Questions