Reputation: 198
I'm trying to push some academic POC to work that rely on pyspark with com.databricks:spark-xml. The goal is to load the Stack Exchange Data Dump xml format (https://archive.org/details/stackexchange) to pyspark df.
It works like a charm with correctly formatted xml with proper tags but fail with Stack Exchange Dump as follows:
<users>
<row Id="-1" Reputation="1" CreationDate="2014-07-30T18:05:25.020" DisplayName="Community" LastAccessDate="2014-07-30T18:05:25.020" Location="on the server farm" AboutMe=" I feel pretty, Oh, so pretty" Views="0" UpVotes="26" DownVotes="701" AccountId="-1" />
</users>
Depending on the root tag, row tag I'm getting empty schema or..something:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.xml').option("rowTag", "users").load('./tmp/test/Users.xml')
df.printSchema()
df.show()
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _AboutMe: string (nullable = true)
| | |-- _AccountId: long (nullable = true)
| | |-- _CreationDate: string (nullable = true)
| | |-- _DisplayName: string (nullable = true)
| | |-- _DownVotes: long (nullable = true)
| | |-- _Id: long (nullable = true)
| | |-- _LastAccessDate: string (nullable = true)
| | |-- _Location: string (nullable = true)
| | |-- _ProfileImageUrl: string (nullable = true)
| | |-- _Reputation: long (nullable = true)
| | |-- _UpVotes: long (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _Views: long (nullable = true)
| | |-- _WebsiteUrl: string (nullable = true)
+--------------------+
| row|
+--------------------+
|[[Hi, I'm not ......|
+--------------------+
Spark : 1.6.0 Python : 2.7.15 Com.databricks : spark-xml_2.10:0.4.1
I would be extremely grateful for any advise.
Kind Regards, P.
Upvotes: 0
Views: 2817
Reputation: 13926
I tried the same method (spark-xml on stackoverflow dump files) some time ago and I failed... Mostly because DF is seen as an array of structures and the processing performance was really bad. Instead, I recommend to use standard text reader and map Key="Value" in every line with UDF like this:
pattern = re.compile(' ([A-Za-z]+)="([^"]*)"')
parse_line = lambda line: {key:value for key,value in pattern.findall(line)}
You can also use my code to get the proper data types: https://github.com/szczeles/pyspark-notebooks/blob/master/stackoverflow/stackexchange-convert.ipynb (the schema matches dumps for March 2017).
Upvotes: 1