Isaac Sullivan
Isaac Sullivan

Reputation: 77

Performance of GeoPandas when converting from Pandas WKT

I have a requirement to read approx. 10 million records from a PostGIS database into a GeoPandas dataframe. Reading the data directly from the database takes approx. 15 minutes via the following:

geopandas.GeoDataFrame.from_postgis(sql, engine)

This is acceptable, but I have been trying to improve the read performance by using the PostgreSQL COPY command along with the SQLAlchemy copy_export function. Reading the data using this method into a Pandas dataframe takes approx. 60 seconds which is a huge improvement:

def read_data(engine, sql):
    with tempfile.TemporaryFile() as tmpFile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
            query=sql, head='HEADER'
        )
        con = engine.raw_connection()
        cur = con.cursor()
        cur.copy_expert(copy_sql, tmpFile)
        tmpFile.seek(0)
        df = pandas.read_csv(tmpFile)
        return df

When trying to do the same, but reading the data into a GeoPandas dataframe, I am running into issues related to the temporary file being used by another process:

def read_data(engine, sql):
    with tempfile.NamedTemporaryFile(suffix='.csv') as tmpFile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
            query=sql, head='HEADER'
        )
        con = engine.raw_connection()
        cur = con.cursor()
        cur.copy_expert(copy_sql, tmpFile)
        tmpFile.seek(0)
        gdf = geopandas.read_file(tmpFile.name)
        return gdf
fiona.errors.DriverError: C:\Temp\4\tmpiuu6dvl4.csv: file used by other process

I have tried various ways of releasing the lock on the temporary file without success, so I went back to reading the data into a Pandas dataframe, then converting the geometry column. This works but takes as much time as just reading the data direct from the database into a GeoPandas dataframe:

def read_data(engine, sql):
    with tempfile.TemporaryFile() as tmpFile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
            query=sql, head='HEADER'
        )
        con = engine.raw_connection()
        cur = con.cursor()
        cur.copy_expert(copy_sql, tmpFile)
        tmpFile.seek(0)
        df = pandas.read_csv(tmpFile)
        df['geom'] = geopandas.GeoSeries.from_wkt(df['geom'])
        return geopandas.GeoDataFrame(df, geometry='geom', crs='EPSG:3857')

The part that is taking a very long time is the conversion from WKT to the GeoSeries:

df['geom'] = geopandas.GeoSeries.from_wkt(df['geom'])

Does anyone know of a solution to either solve the locked file issue or to speed up the conversion from WKT to GeoSeries?

Thanks

Upvotes: 1

Views: 1050

Answers (1)

martinfleis
martinfleis

Reputation: 7814

GeoPandas has to create geometry objects, that is what takes time. It doesn't matter if you use GeoDataFrame.from_postgis or your custom code because even if your read_data worked you would end with WKT/WKB representation of geometries and have to call from_wkt anyway.

GeoPandas currently depends on shapely to do the conversion but it has experimental support of pygeos, which will likely be faster. Make sure you have pygeos in your environment and try GeoDataFrame.from_postgis again. That code is quite well optimized already so I am not convinced you can get easy speedup by using custom code.

To get pygeos:

# conda
conda install pygeos --channel conda-forge
# pip
pip install pygeos

See https://geopandas.readthedocs.io/en/latest/getting_started/install.html#using-the-optional-pygeos-dependency

Upvotes: 1

Related Questions