Performance of GeoPandas when converting from Pandas WKT

Question

I have a requirement to read approx. 10 million records from a PostGIS database into a GeoPandas dataframe. Reading the data directly from the database takes approx. 15 minutes via the following:

geopandas.GeoDataFrame.from_postgis(sql, engine)

This is acceptable, but I have been trying to improve the read performance by using the PostgreSQL COPY command along with the SQLAlchemy copy_export function. Reading the data using this method into a Pandas dataframe takes approx. 60 seconds which is a huge improvement:

def read_data(engine, sql):
    with tempfile.TemporaryFile() as tmpFile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
            query=sql, head='HEADER'
        )
        con = engine.raw_connection()
        cur = con.cursor()
        cur.copy_expert(copy_sql, tmpFile)
        tmpFile.seek(0)
        df = pandas.read_csv(tmpFile)
        return df

When trying to do the same, but reading the data into a GeoPandas dataframe, I am running into issues related to the temporary file being used by another process:

def read_data(engine, sql):
    with tempfile.NamedTemporaryFile(suffix='.csv') as tmpFile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
            query=sql, head='HEADER'
        )
        con = engine.raw_connection()
        cur = con.cursor()
        cur.copy_expert(copy_sql, tmpFile)
        tmpFile.seek(0)
        gdf = geopandas.read_file(tmpFile.name)
        return gdf

fiona.errors.DriverError: C:\Temp\4	mpiuu6dvl4.csv: file used by other process

I have tried various ways of releasing the lock on the temporary file without success, so I went back to reading the data into a Pandas dataframe, then converting the geometry column. This works but takes as much time as just reading the data direct from the database into a GeoPandas dataframe:

def read_data(engine, sql):
    with tempfile.TemporaryFile() as tmpFile:
        copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
            query=sql, head='HEADER'
        )
        con = engine.raw_connection()
        cur = con.cursor()
        cur.copy_expert(copy_sql, tmpFile)
        tmpFile.seek(0)
        df = pandas.read_csv(tmpFile)
        df['geom'] = geopandas.GeoSeries.from_wkt(df['geom'])
        return geopandas.GeoDataFrame(df, geometry='geom', crs='EPSG:3857')

The part that is taking a very long time is the conversion from WKT to the GeoSeries:

df['geom'] = geopandas.GeoSeries.from_wkt(df['geom'])

Does anyone know of a solution to either solve the locked file issue or to speed up the conversion from WKT to GeoSeries?

Thanks

Performance of GeoPandas when converting from Pandas WKT

Answers (1)

Related Questions