William Fabyan
William Fabyan

Reputation: 3

Scaling OSMNX library's 'nearest_edges' function on huge spark dataset

I am trying to scale the distance value returned from the 'nearest_edges' function(from the OSMNX library) on a huge dataset using the lat and long columns as my inputs to the creation of my mutlidigraph. It takes forever to run and sometimes returns null. Is there any other solution? I created a user defined function (code below) so I can apply that function to the dataset using the long lat columns of that dataset.

My code below:

import osmnx as ox
@udf(returnType=T.DoubleType())
def get_distance_to_road (lat_dd=None,long_dd=None,dist_bbox=None):
    try:
      location = (lat_dd,long_dd)

      G = ox.graph_from_point(
        center_point=location, 
        dist=dist_bbox,       #meter
        simplify=True, 
        retain_all=True,
        truncate_by_edge=True,
        network_type='all'
        )

      Gp = ox.project_graph(G)
      point_geom_proj, crs = ox.projection.project_geometry(Point(reversed(location)), to_crs=Gp.graph['crs'])
      distance = np.round(ox.nearest_edges(Gp, point_geom_proj.x, point_geom_proj.y, return_dist=True)[1],2).item() 
      
    except Exception:
      distance = None
    return distance  #meter

Upvotes: 0

Views: 86

Answers (2)

gboeing
gboeing

Reputation: 6442

The nearest_edges function is fast and scalable. Rather, your problem here is everything else you're doing each time you call nearest_edges. First off, you always want to run it vectorized rather than in a loop. That is, if you have many points to snap to their nearest edges, pass them all at once as numpy arrays to the nearest_edges function for vectorized, spatial indexed look-up:

import osmnx as ox

# get projected graph and randomly sample some points to find nearest edges to
G = ox.graph.graph_from_place("Piedmont, CA, USA", network_type="drive")
Gp = ox.projection.project_graph(G)
points = ox.utils_geo.sample_points(ox.convert.to_undirected(Gp), n=1000000)

%%time
ne, dist = ox.distance.nearest_edges(Gp, X=points.x, Y=points.y, return_dist=True)
# wall time = 8.3 seconds

Here, the nearest_edges search matched 1 million points to their nearest edges in about 8 seconds. If you instead put this all into a loop (which with each iteration builds a graph, projects the graph and point, then finds the nearest edge to that one point), matching these million points will take approximately forever. This isn't because nearest_edges is slow... it's because everything else in the loop is (relatively) slow.

Your basic options are:

  1. Vectorize everything as demonstrated above.
  2. If you must build separate graphs (like, you're modeling completely different cities or countries or something), try to reduce the number of graphs you build by batching your nearby points to search within a single graph.
  3. Use multiprocessing to parallelize.

Upvotes: 0

McToel
McToel

Reputation: 351

Your example does not give me code to try out myself, but in general I have noticed that OSMnx is not suited for large amounts of data. Especially, nearest_edges uses a lot of CPU and RAM to build an index and then query on that index. However, nearest_edges should work and is optimized for speed when querying many points. I would try the following things:

Start with a smaller subset of data

Only use as much data in the beginning as you absolutely need in order to test your functionality. Then, if it works, just let it run for the time it needs.

Profile your code

Run your code with cprofile or similar in order to see which part of OSMnx is really making it slow and go from there.

Upvotes: 0

Related Questions