Reputation: 77
I have a requirement to read approx. 10 million records from a PostGIS database into a GeoPandas dataframe. Reading the data directly from the database takes approx. 15 minutes via the following:
geopandas.GeoDataFrame.from_postgis(sql, engine)
This is acceptable, but I have been trying to improve the read performance by using the PostgreSQL COPY command along with the SQLAlchemy copy_export function. Reading the data using this method into a Pandas dataframe takes approx. 60 seconds which is a huge improvement:
def read_data(engine, sql):
with tempfile.TemporaryFile() as tmpFile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=sql, head='HEADER'
)
con = engine.raw_connection()
cur = con.cursor()
cur.copy_expert(copy_sql, tmpFile)
tmpFile.seek(0)
df = pandas.read_csv(tmpFile)
return df
When trying to do the same, but reading the data into a GeoPandas dataframe, I am running into issues related to the temporary file being used by another process:
def read_data(engine, sql):
with tempfile.NamedTemporaryFile(suffix='.csv') as tmpFile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=sql, head='HEADER'
)
con = engine.raw_connection()
cur = con.cursor()
cur.copy_expert(copy_sql, tmpFile)
tmpFile.seek(0)
gdf = geopandas.read_file(tmpFile.name)
return gdf
fiona.errors.DriverError: C:\Temp\4\tmpiuu6dvl4.csv: file used by other process
I have tried various ways of releasing the lock on the temporary file without success, so I went back to reading the data into a Pandas dataframe, then converting the geometry column. This works but takes as much time as just reading the data direct from the database into a GeoPandas dataframe:
def read_data(engine, sql):
with tempfile.TemporaryFile() as tmpFile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=sql, head='HEADER'
)
con = engine.raw_connection()
cur = con.cursor()
cur.copy_expert(copy_sql, tmpFile)
tmpFile.seek(0)
df = pandas.read_csv(tmpFile)
df['geom'] = geopandas.GeoSeries.from_wkt(df['geom'])
return geopandas.GeoDataFrame(df, geometry='geom', crs='EPSG:3857')
The part that is taking a very long time is the conversion from WKT to the GeoSeries:
df['geom'] = geopandas.GeoSeries.from_wkt(df['geom'])
Does anyone know of a solution to either solve the locked file issue or to speed up the conversion from WKT to GeoSeries?
Thanks
Upvotes: 1
Views: 1050
Reputation: 7814
GeoPandas has to create geometry objects, that is what takes time. It doesn't matter if you use GeoDataFrame.from_postgis
or your custom code because even if your read_data
worked you would end with WKT/WKB representation of geometries and have to call from_wkt
anyway.
GeoPandas currently depends on shapely to do the conversion but it has experimental support of pygeos, which will likely be faster. Make sure you have pygeos in your environment and try GeoDataFrame.from_postgis
again. That code is quite well optimized already so I am not convinced you can get easy speedup by using custom code.
To get pygeos:
# conda
conda install pygeos --channel conda-forge
# pip
pip install pygeos
Upvotes: 1