Reputation: 405
I have a created a geopandas dataframe with 50 million records which contain Latitude Longitude in CRS 3857 and I want to convert to 4326. Since the dataset is huge the geopandas unable to convert this.how i can execute this in distributed manner.
df = sdf.toPandas()
gdf = gpd.GeoDataFrame(
df.drop(['Longitude', 'Latitude'], axis=1),
crs={'init': 'epsg:4326'},
geometry=[Point(xy) for xy in zip(df.Longitude, df.Latitude)])
return gdf
result_gdf=convert_crs(grid_df)
Upvotes: 3
Views: 4917
Reputation: 721
See: https://github.com/geopandas/geopandas/issues/1400
This is very fast and memory efficient:
from pyproj import Transformer
trans = Transformer.from_crs(
"EPSG:4326",
"EPSG:3857",
always_xy=True,
)
xx, yy = trans.transform(df["Longitude"].values, df["Latitude"].values)
df["X"] = xx
df["Y"] = yy
Upvotes: 3
Reputation: 791
I hope this answer is fair enough, because it will effectively solve your problem for any size of dataset. And it's a well-trodden kind of answer to how to deal with data that's too big for memory.
Answer: Store your data in PostGIS
You would then have two options for doing what you want.
In my experience it's solid, reliable and pretty well integrated with GeoPandas via GeoAlchemy2.
Upvotes: 1
Reputation: 15452
See the geopandas docs on installation and make sure you have the latest version of geopandas and PyGeos installed. From the installation docs:
Work is ongoing to improve the performance of GeoPandas. Currently, the fast implementations of basic spatial operations live in the PyGEOS package (but work is under way to contribute those improvements to Shapely). Starting with GeoPandas 0.8, it is possible to optionally use those experimental speedups by installing PyGEOS.
Note the caveat that to_crs will ignore & drop any z coordinate information, so if this is important you unfortunately cannot use these speedups and something like dask_geopandas may be required.
However, with a recent version of geopandas and PyGeos installed, converting the CRS of 50 million points should be possible. The following generates 50m random points (<1s), creates a GeoDataFrame with geometries from the points in WGS84 (18s), converts them all to web mercator (1m21s) and then converts them back to WGS84 (54s):
In [1]: import geopandas as gpd, pandas as pd, numpy as np
In [2]: %%time
...: n = int(50e6)
...: lats = np.random.random(size=n) * 180 - 90
...: lons = np.random.random(size=n) * 360 - 180
...:
...:
CPU times: user 613 ms, sys: 161 ms, total: 774 ms
Wall time: 785 ms
In [3]: %%time
...: df = gpd.GeoDataFrame(geometry=gpd.points_from_xy(lons, lats, crs="epsg:4326"))
...:
...:
CPU times: user 11.7 s, sys: 4.66 s, total: 16.4 s
Wall time: 17.8 s
In [4]: %%time
...: df_mercator = df.to_crs("epsg:3857")
...:
...:
CPU times: user 1min 1s, sys: 13.7 s, total: 1min 15s
Wall time: 1min 21s
In [5]: %%time
...: df_wgs84 = df_mercator.to_crs("epsg:4326")
...:
...:
CPU times: user 39.4 s, sys: 9.59 s, total: 49 s
Wall time: 54 s
I ran this on a 2021 Apple M1 Max chip with 32 GB of memory using Geopandas v0.10.2 and PyGeos v0.12.0. The real memory usage peaked at around 9 GB - it's possible your computer is facing memory constraints or the runtime may be an issue. If so, additional debugging details and the full workflow would definitely be helpful! But this seems like a workflow that should be doable on most computers - you may need to partition the data and work through it in chunks if you're facing memory constraints but it's within a single order of magnitude of what most computers should be able to handle.
Upvotes: 1