Reputation: 87
I have a large (O(10^6) rows) dataset (points with values) where I need to do the following for all points:
The "non-vectorised" approach would be to simply loop over all points... for all points and then apply the logic. That scales poorly, however.
I have included a toy example that does what I want. Of ideas I have already considered are:
Here's a toy example of the logic I want to implement:
import pandas as pd
import numpy as np
from shapely.wkt import loads
import geopandas as gp
points=[
'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
]
values=[9,8,7,6,5,4,3,2,1]
df=pd.DataFrame({'points':points,'values':values})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})
for index,row in gdf.iterrows(): # Looping over all points
gdf['dist'] = np.nan
for index2,row2 in gdf.iterrows(): # Looping over all the other points
if index==index2: continue
d=row['geometry'].distance(row2['geometry']) # Calculate distance
if d<3: gdf.at[index2,'dist']=d # If within cutoff: Store
else: gdf.at[index2,'dist']=np.nan # Otherwise, be paranoid and leave NAN
# Calculating mean of values for the 3 nearest points and storing
gdf.at[index,'mean']=np.mean(gdf.sort_values('dist').head(3)['values'].tolist())
print(gdf)
The resulting GeoDataframe is here:
points values geometry dist mean
0 POINT (1 1.1) 9 POINT (1 1.1) 2.758623 6.333333
1 POINT (1 1.9) 8 POINT (1 1.9) 2.282542 7.000000
2 POINT (1 3.1) 7 POINT (1 3.1) 2.002498 5.666667
3 POINT (2 1) 6 POINT (2 1) 2.236068 5.666667
4 POINT (2 2.1) 5 POINT (2 2.1) 1.345362 4.666667
5 POINT (2 2.9) 4 POINT (2 2.9) 1.004988 4.333333
6 POINT (3 0.8) 3 POINT (3 0.8) 2.200000 4.333333
7 POINT (3 2) 2 POINT (3 2) 1.000000 3.000000
8 POINT (3 3) 1 POINT (3 3) NaN 3.666667
You can see the state of the last iteration.
How do I do this in a more scalable manner?
Upvotes: 2
Views: 19672
Reputation: 7804
I would use spatial index for that. You can use the capability of libpysal
, which uses KDTree under the hood. For 2000 random points, following code runs for 3.5s compared to your, which runs for ages (I've lost patience after the first minute). Saving values to list and then transforming the list into the column of DF saves you some time as well.
import pandas as pd
import numpy as np
from shapely.wkt import loads
import geopandas as gp
import libpysal
points=[
'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
]
values=[9,8,7,6,5,4,3,2,1]
df=pd.DataFrame({'points':points,'values':values})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})
knn3 = libpysal.weights.KNN.from_dataframe(gdf, k=3)
means = []
for index, row in gdf.iterrows(): # Looping over all points
knn_neighbors = knn3.neighbors[index]
knnsubset = gdf.iloc[knn_neighbors]
neighbors = []
for ix, r in knnsubset.iterrows():
if r.geometry.distance(row.geometry) < 3: # max distance here
neighbors.append(ix)
subset = gdf.iloc[list(neighbors)]
means.append(np.mean(subset['values']))
gdf['mean'] = means
This is the result:
points values geometry mean
0 POINT (1 1.1) 9 POINT (1 1.1) 6.333333
1 POINT (1 1.9) 8 POINT (1 1.9) 7.000000
2 POINT (1 3.1) 7 POINT (1 3.1) 5.666667
3 POINT (2 1) 6 POINT (2 1) 5.666667
4 POINT (2 2.1) 5 POINT (2 2.1) 4.666667
5 POINT (2 2.9) 4 POINT (2 2.9) 4.333333
6 POINT (3 0.8) 3 POINT (3 0.8) 4.333333
7 POINT (3 2) 2 POINT (3 2) 3.000000
8 POINT (3 3) 1 POINT (3 3) 3.666667
Upvotes: 9