TheRealJimShady
TheRealJimShady

Reputation: 917

GeoTiff raster data to Delta Lake / Parquet format?

Our organisation has been using Databricks recently for ETL and development of datasets. However I have found the libraries/capabilities for raster datasets very limiting. There are a few raster/Spark libraries around, but they are not very mature. For example GeoTrellis, RasterFrames and Apache Sedona.

I have therefore been exploring alternative ways to efficiently work with raster data on the Databricks platform, which leverages Spark / Delta tables / Parquet files.

One idea I had was to dump the raster data to simple x, y, value columns and load them as tables. Providing my other datasets are of the same resolution (I will pre-process them so that they are), I should then be able to do simple SQL queries for masking/addition/subtraction and more complex user-defined functions.

Step one, I thought would be to dump my raster to points as a CSV, and then I can load to a Delta table. But after 12 hours running on my Databricks cluster (128GB memory, 16 cores), a 3GB raster had still not finished (I was using the gdal2xyz function below).

Does anyone have a quicker way to dump a raster to CSV? Or even better, directly to parquet format.

python gdal2xyz.py -band 1 -skipnodata "AR_FLRF_UD_Q1500_RD_02.tif" "AR_FLRF_UD_Q1500_RD_02.csv"

Maybe I can tile the raster, dump each CSV to file using parallel processing, and then bind the CSV files together but it seems a bit laborious.

Upvotes: 2

Views: 1700

Answers (2)

Robert Hijmans
Robert Hijmans

Reputation: 47481

GDAL has a parquet driver since version 3.5. So, with at least that version, you should be able to write raster data to parquet with "terra" like this

library(terra)
x <- vect(system.file("ex/lux.shp", package="terra"))
writeVector(x, "file.pqt", filetype="Parquet")

You can check the version that "terra" uses with terra::gdal().

However, this is not included in the CRAN build for Windows.

Upvotes: 2

Jia Yu - Apache Sedona
Jia Yu - Apache Sedona

Reputation: 304

You can use Sedona to easily load GeoTiffs to DataFrame and save the dataframe as Parquet format. See here: https://sedona.apache.org/latest-snapshot/api/sql/Raster-loader/

Upvotes: 2

Related Questions