Reputation: 81
I have a set 100k of of geo locations (lat/lon) and a hexogonal grid (4k polygons). My goal is to calculate the total number of points which are located within each polygon.
My current algorithm uses 2 for loops to loop over all geo points and all polygons, which is really slow if I increase the number of polygons... How would you speedup the algorithm? I have uploaded a minimal example which creates 100k random geo points and uses 561 cells in the grid...
I also saw that reading the geo json file (with 4k polygons) takes some time, maybe i should export the polygons into a csv?
hexagon_grid.geojson file: https://gist.github.com/Arnold1/9e41454e6eea910a4f6cd68ff1901db1
minimal python example: https://gist.github.com/Arnold1/ee37a2e4b2dfbfdca9bfae7c7c3a3755
Upvotes: 2
Views: 492
Reputation: 74262
You don't need to explicitly test each hexagon to see whether a given point is located inside it.
Let's assume, for the moment, that all of your points fall somewhere within the bounds of your hexagonal grid. Because your hexagons form a regular lattice, you only really need to know which of the hexagon centers is closest to each point.
This can be computed very efficiently using a scipy.spatial.cKDTree
:
import numpy as np
from scipy.spatial import cKDTree
import json
with open('/tmp/grid.geojson', 'r') as f:
data = json.load(f)
verts = []
centroids = []
for hexagon in data['features']:
# a (7, 2) array of xy coordinates specifying the vertices of the hexagon.
# we ignore the last vertex since it's equal to the first
xy = np.array(hexagon['geometry']['coordinates'][0][:6])
verts.append(xy)
# compute the centroid by taking the average of the vertex coordinates
centroids.append(xy.mean(0))
verts = np.array(verts)
centroids = np.array(centroids)
# construct a k-D tree from the centroid coordinates of the hexagons
tree = cKDTree(centroids)
# generate 10000 normally distributed xy coordinates
sigma = 0.5 * centroids.std(0, keepdims=True)
mu = centroids.mean(0, keepdims=True)
gen = np.random.RandomState(0)
xy = (gen.randn(10000, 2) * sigma) + mu
# query the k-D tree to find which hexagon centroid is nearest to each point
distance, idx = tree.query(xy, 1)
# count the number of points that are closest to each hexagon centroid
counts = np.bincount(idx, minlength=centroids.shape[0])
Plotting the output:
from matplotlib import pyplot as plt
fig, ax = plt.subplots(1, 1, subplot_kw={'aspect': 'equal'})
ax.hold(True)
ax.scatter(xy[:, 0], xy[:, 1], 10, c='b', alpha=0.25, edgecolors='none')
ax.scatter(centroids[:, 0], centroids[:, 1], marker='h', s=(counts + 5),
c=counts, cmap='Reds')
ax.margins(0.01)
I can think of several different ways you could handle points that fall outside your grid depending on how much accuracy you need:
You could exclude points that fall outside the outer bounding rectangle of your hexagon vertices (i.e. x < xmin
, x > xmax
etc.). However, this will fail to exclude points that fall within the 'gaps' along the edges of your grid.
Another straightforward option would be to set a cut-off on distance
according to the spacing of your hexagon centers, which is equivalent to using a circular approximation for your outer hexagons.
If accuracy is crucial then you could define a matplotlib.path.Path
corresponding to the outer vertices of your hexagonal grid, then use its .contains_points()
method to test whether your points are contained within it. Compared to the other two methods, this would probably be slower and more fiddly to code.
Upvotes: 7