How to (smartly) loop over all points in a GeoDataframe and look at nearest neighbours

Question

I have a large (O(10^6) rows) dataset (points with values) where I need to do the following for all points:

Find the 3 nearest points within a predefined radius.
Calculate the mean of an associated value to these three points.
Save that mean value to the point I am looking at

The "non-vectorised" approach would be to simply loop over all points... for all points and then apply the logic. That scales poorly, however.

I have included a toy example that does what I want. Of ideas I have already considered are:

using shapely.ops.nearest_points: That, however, only appears to return the one nearest point.
buffering around each individual point and making an sjoin with the original GeoDataframe: That seems like it would scale even poorer than the naive approach.

Here's a toy example of the logic I want to implement:

import pandas as pd
import numpy as np
from shapely.wkt import loads
import geopandas as gp

points=[
    'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
    'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
    'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
]
values=[9,8,7,6,5,4,3,2,1]

df=pd.DataFrame({'points':points,'values':values})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})

for index,row in gdf.iterrows(): # Looping over all points
    gdf['dist'] = np.nan
    for index2,row2 in gdf.iterrows(): # Looping over all the other points
        if index==index2: continue
        d=row['geometry'].distance(row2['geometry']) # Calculate distance
        if d<3: gdf.at[index2,'dist']=d # If within cutoff: Store
        else: gdf.at[index2,'dist']=np.nan # Otherwise, be paranoid and leave NAN
    # Calculating mean of values for the 3 nearest points and storing 
    gdf.at[index,'mean']=np.mean(gdf.sort_values('dist').head(3)['values'].tolist())

print(gdf)

The resulting GeoDataframe is here:

          points  values       geometry      dist      mean
0  POINT (1 1.1)       9  POINT (1 1.1)  2.758623  6.333333
1  POINT (1 1.9)       8  POINT (1 1.9)  2.282542  7.000000
2  POINT (1 3.1)       7  POINT (1 3.1)  2.002498  5.666667
3    POINT (2 1)       6    POINT (2 1)  2.236068  5.666667
4  POINT (2 2.1)       5  POINT (2 2.1)  1.345362  4.666667
5  POINT (2 2.9)       4  POINT (2 2.9)  1.004988  4.333333
6  POINT (3 0.8)       3  POINT (3 0.8)  2.200000  4.333333
7    POINT (3 2)       2    POINT (3 2)  1.000000  3.000000
8    POINT (3 3)       1    POINT (3 3)       NaN  3.666667

You can see the state of the last iteration.

All distances have been calculated apart from the final place which was left at NAN.
The mean value of the last iteration is the mean of the values of the three nearest points: 2, 4 and 5, namely 3,666667.

How do I do this in a more scalable manner?

martinfleis · Accepted Answer

I would use spatial index for that. You can use the capability of libpysal, which uses KDTree under the hood. For 2000 random points, following code runs for 3.5s compared to your, which runs for ages (I've lost patience after the first minute). Saving values to list and then transforming the list into the column of DF saves you some time as well.

import pandas as pd
import numpy as np
from shapely.wkt import loads
import geopandas as gp
import libpysal

points=[
    'POINT (1 1.1)', 'POINT (1 1.9)', 'POINT (1 3.1)',
    'POINT (2 1)', 'POINT (2 2.1)', 'POINT (2 2.9)',
    'POINT (3 0.8)', 'POINT (3 2)', 'POINT (3 3)'
]
values=[9,8,7,6,5,4,3,2,1]

df=pd.DataFrame({'points':points,'values':values})
gdf=gp.GeoDataFrame(df,geometry=[loads(x) for x in df.points], crs={'init': 'epsg:' + str(25832)})

knn3 = libpysal.weights.KNN.from_dataframe(gdf, k=3)

means = []
for index, row in gdf.iterrows(): # Looping over all points
    knn_neighbors = knn3.neighbors[index]
    knnsubset = gdf.iloc[knn_neighbors]
    neighbors = []
    for ix, r in knnsubset.iterrows():
        if r.geometry.distance(row.geometry) < 3: # max distance here
            neighbors.append(ix)

    subset = gdf.iloc[list(neighbors)]
    means.append(np.mean(subset['values']))
gdf['mean'] = means

This is the result:

          points  values       geometry      mean
0  POINT (1 1.1)       9  POINT (1 1.1)  6.333333
1  POINT (1 1.9)       8  POINT (1 1.9)  7.000000
2  POINT (1 3.1)       7  POINT (1 3.1)  5.666667
3    POINT (2 1)       6    POINT (2 1)  5.666667
4  POINT (2 2.1)       5  POINT (2 2.1)  4.666667
5  POINT (2 2.9)       4  POINT (2 2.9)  4.333333
6  POINT (3 0.8)       3  POINT (3 0.8)  4.333333
7    POINT (3 2)       2    POINT (3 2)  3.000000
8    POINT (3 3)       1    POINT (3 3)  3.666667

How to (smartly) loop over all points in a GeoDataframe and look at nearest neighbours

Answers (1)

Related Questions