Reevus
Reevus

Reputation: 69

consume gzip files with databricks autoloader

I am currently unable to find a direct way to load .gz files via autoloader. I can load the files as a binary content but I cannot extract the compressed xml files and process them further in a streaming way.

Therefore, I would like to know if there is a way to consume the content of a gzip file via databricks autoloader

Upvotes: 3

Views: 774

Answers (1)

Zac
Zac

Reputation: 730

Before, Databricks’ users had to load an external package ‘spark-xml’ to read and write XML data. But it didn’t work well with streaming (autoloader, for example) and serverless and lacked advanced capabilities like schema evolution that are available with other text formats like csv and json.

Since Databricks Runtime 14.1 and above, Native XML is supported. You can also use it with Auto Loader.

It supports gzip, you don't need to unzip it, just read as if it's xml. I tested autoloader on S3 to read a gzip XML file using the following snippet and it worked as expected.

query = (spark
.readStream
.format("cloudFiles")
.option("cloudFiles.format", "xml")
.option("rowTag", "FI")
.option("cloudFiles.inferColumnTypes", True)
.option("cloudFiles.schemaLocation", "/Volumes/****/schema")
.option("cloudFiles.schemaEvolutionMode", "rescue")
.load("/Volumes/****/data")
.writeStream
.format("delta")
.option("mergeSchema", "true")
.option("checkpointLocation", "/Volumes/****/checkpoint")
.trigger(availableNow=True)
.start("/Volumes/****/test")
.awaitTermination()
)

Then you can check the output data.

df=spark.read.format("delta").load("/Volumes/****/test")

Note: This feature supports Python

reference: https://docs.databricks.com/en/_extras/documents/native-xml-private-preview.pdf

https://docs.databricks.com/en/query/formats/xml.html#language-scala

Upvotes: 0

Related Questions