Reputation: 69
I am currently unable to find a direct way to load .gz files via autoloader. I can load the files as a binary content but I cannot extract the compressed xml files and process them further in a streaming way.
Therefore, I would like to know if there is a way to consume the content of a gzip file via databricks autoloader
Upvotes: 3
Views: 774
Reputation: 730
Before, Databricks’ users had to load an external package ‘spark-xml’ to read and write XML data. But it didn’t work well with streaming (autoloader, for example) and serverless and lacked advanced capabilities like schema evolution that are available with other text formats like csv and json.
Since Databricks Runtime 14.1 and above, Native XML is supported. You can also use it with Auto Loader.
It supports gzip, you don't need to unzip it, just read as if it's xml. I tested autoloader on S3 to read a gzip XML file using the following snippet and it worked as expected.
query = (spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "xml")
.option("rowTag", "FI")
.option("cloudFiles.inferColumnTypes", True)
.option("cloudFiles.schemaLocation", "/Volumes/****/schema")
.option("cloudFiles.schemaEvolutionMode", "rescue")
.load("/Volumes/****/data")
.writeStream
.format("delta")
.option("mergeSchema", "true")
.option("checkpointLocation", "/Volumes/****/checkpoint")
.trigger(availableNow=True)
.start("/Volumes/****/test")
.awaitTermination()
)
Then you can check the output data.
df=spark.read.format("delta").load("/Volumes/****/test")
Note: This feature supports Python
reference: https://docs.databricks.com/en/_extras/documents/native-xml-private-preview.pdf
https://docs.databricks.com/en/query/formats/xml.html#language-scala
Upvotes: 0