jpinelo
jpinelo

Reputation: 1454

Spatial database architecture with Apache Parquet, PostgresSQL and PostGIS on on-premises bare-metal S3/MinIo cluster

Designing storage architecture for Petabyte-scale geospatial data; starting from scratch. Creating a MinIo cluster to store the objects in S3 buckets. To store the metadata, I’m considering the Apache Parquet format managed by PostgresSQL, extended by PostGIS. Using Parquet implies resourcing to a PostgresSQL FDW (foreign data wrapper) - ParquetS3. No doubts regarding Parquet’s main advantages when compared to "pure" PostgresSQL: higher query speed, higher compression rate, and the ability to store in S3, which for us is a significant advantage. However, I cannot find documentation or a use case of Parquet with geospatial data.

Should I worry about losing geospatial query features (the PostGIS support) due to the use of Parquet or it’s wrapper for PostgreSQL - ParquetS3?

According to PostgreSQL’s documentation, the wrapper ParquetS3 is valid for PostgreSQL, but I see no mention to the extension PostGIS. On this same PostgreSQL FDW list, there are some geo data wrappers, but these aren’t meant to use use S3/MinIo. According to PostGIS documentation, PostGIS ships with two FWD (Oracle FDW and OGR FDW), but this is a post from 2014.

The code repository for OGR FDW reads:

OGR is the vector half of the GDAL spatial data access library. It allows access to a large number of GIS data formats using a simple C API for data reading and writing. Since OGR exposes a simple table structure and PostgreSQL foreign data wrappers allow access to table structures, the fit seems pretty perfect.

This is followed by a list of limitations of the implementation.

Does the OGR FDW work below the ParquetS3 one, despite the limitations listed? Does anyone know of a FDW for geo data (PostGIS), that also allows sourcing from MinIo S3 buckets? With Parquet or equivalent? Or has anyone tested the configuration above?

I always have the alternative of creating the data store of the PostgreSQL/PostGIS outside of S3, but I’d rather have it in the MinIo cluster.

Thanks.

Upvotes: 1

Views: 985

Answers (1)

RaczeQ
RaczeQ

Reputation: 136

As for Parquet, there is a special format for geospatial data called geoparquet created by Open Geospatial Consortium.

Based on this answer, geoparquet embeds it's geo data as WKB so it should be straightforward to use it in PostGIS. However official documentation currently mentions integration only with geopandas python library, sfarrow R library and GDAL/OGR tools.

Worst-case scenario - you'd have to write a python script for digesting data from S3 and uploading it into PostGIS in some form of ETL (maybe with Apache Airflow).

Upvotes: 0

Related Questions