code_bug
code_bug

Reputation: 405

Geopandas convert crs

I have a created a geopandas dataframe with 50 million records which contain Latitude Longitude in CRS 3857 and I want to convert to 4326. Since the dataset is huge the geopandas unable to convert this.how i can execute this in distributed manner.

    df = sdf.toPandas()
    gdf = gpd.GeoDataFrame(
    df.drop(['Longitude', 'Latitude'], axis=1),
    crs={'init': 'epsg:4326'},
    geometry=[Point(xy) for xy in zip(df.Longitude, df.Latitude)])
    return gdf

result_gdf=convert_crs(grid_df)

Upvotes: 3

Views: 4917

Answers (3)

snowman2
snowman2

Reputation: 721

See: https://github.com/geopandas/geopandas/issues/1400

This is very fast and memory efficient:

from pyproj import Transformer

trans = Transformer.from_crs(
    "EPSG:4326",
    "EPSG:3857",
    always_xy=True,
)
xx, yy = trans.transform(df["Longitude"].values, df["Latitude"].values)
df["X"] = xx
df["Y"] = yy

Upvotes: 3

DavidH
DavidH

Reputation: 791

I hope this answer is fair enough, because it will effectively solve your problem for any size of dataset. And it's a well-trodden kind of answer to how to deal with data that's too big for memory.

Answer: Store your data in PostGIS

You would then have two options for doing what you want.

  1. Do data manipulations in PostGIS, using its geo-spatial SQL syntax. The database will do the memory management for you.
  2. Retrieve data a chunk at a time, do the manipulation in GeoPandas and rewrite to the database.

In my experience it's solid, reliable and pretty well integrated with GeoPandas via GeoAlchemy2.

Upvotes: 1

Michael Delgado
Michael Delgado

Reputation: 15452

See the geopandas docs on installation and make sure you have the latest version of geopandas and PyGeos installed. From the installation docs:

Work is ongoing to improve the performance of GeoPandas. Currently, the fast implementations of basic spatial operations live in the PyGEOS package (but work is under way to contribute those improvements to Shapely). Starting with GeoPandas 0.8, it is possible to optionally use those experimental speedups by installing PyGEOS.

Note the caveat that to_crs will ignore & drop any z coordinate information, so if this is important you unfortunately cannot use these speedups and something like dask_geopandas may be required.

However, with a recent version of geopandas and PyGeos installed, converting the CRS of 50 million points should be possible. The following generates 50m random points (<1s), creates a GeoDataFrame with geometries from the points in WGS84 (18s), converts them all to web mercator (1m21s) and then converts them back to WGS84 (54s):

In [1]: import geopandas as gpd, pandas as pd, numpy as np

In [2]: %%time
   ...: n = int(50e6)
   ...: lats = np.random.random(size=n) * 180 - 90
   ...: lons = np.random.random(size=n) * 360 - 180
   ...:
   ...:
CPU times: user 613 ms, sys: 161 ms, total: 774 ms
Wall time: 785 ms

In [3]: %%time
   ...: df = gpd.GeoDataFrame(geometry=gpd.points_from_xy(lons, lats, crs="epsg:4326"))
   ...:
   ...:
CPU times: user 11.7 s, sys: 4.66 s, total: 16.4 s
Wall time: 17.8 s

In [4]: %%time
   ...: df_mercator = df.to_crs("epsg:3857")
   ...:
   ...:
CPU times: user 1min 1s, sys: 13.7 s, total: 1min 15s
Wall time: 1min 21s

In [5]: %%time
   ...: df_wgs84 = df_mercator.to_crs("epsg:4326")
   ...:
   ...:
CPU times: user 39.4 s, sys: 9.59 s, total: 49 s
Wall time: 54 s

I ran this on a 2021 Apple M1 Max chip with 32 GB of memory using Geopandas v0.10.2 and PyGeos v0.12.0. The real memory usage peaked at around 9 GB - it's possible your computer is facing memory constraints or the runtime may be an issue. If so, additional debugging details and the full workflow would definitely be helpful! But this seems like a workflow that should be doable on most computers - you may need to partition the data and work through it in chunks if you're facing memory constraints but it's within a single order of magnitude of what most computers should be able to handle.

Upvotes: 1

Related Questions