Alessandro Zanoli
Alessandro Zanoli

Reputation: 11

Scalable Python Shortest Path on Large DAG with Multiple Walkers

I have a large (~10 million edges and some ~100k nodes) Directed Acyclic Graph (DAG) and a list of walkers (around ~30k), and each of these walker has an origin and destination node which get connected in the DAG with some additional computation to determine edges (usually ~50 edges per walker). For each walker, I need then to efficiently compute the shortest path, possibly in a scalable parallel manner.

I am fairly fluent only in Python, so I tried a first approach using Dask for parallelization. This worked well enough for the computation of the additional edges; but when doing the actual shortest path computation, it is crucial that I use graph-tool's fast C++ implementation of the algorithm optimized for DAGs.

Here unfortunately it gets a bit unclear to me how to efficiently handle sharing the "base" large DAG among Dask Workers, and then do the node/edges adding and shortest path computing function mapped to each walker.

Also, I am currently working with a single Linux VM with 8 threads and I think enough memory to hold all the computations for each thread, but I can use multiple VMs and I would like to enjoy the possible speed up.

Edit:

What I would do as a for loop in Python is:

import graph_tool as gt

# instantiate the graph_tool graph
g = gt.Graph()

# add edges with weight
g.ep['weight'] = g.new_edge_property('double')
g.add_edge_list(edges[['src','dst','weight']].values,eprops=[g.ep['weight']])

for walker in walkers:
    origin_node       = g.add_vertex()
    destination_node  = g.add_vertex() 
    origin_edges      = function_to_compute_origin_edges(walker)
    destination_edges = function_to_compute_destination_edges(walker)
    g.add_edge_list([origin_edges,destination_edges])

    vertex_list,edge_list = gt.topology.shortest_path(g,origin_node,destination_node,weights=g.ep['weight'],dag=True)

    g.remove_vertices([origin_node,destination_node])

Since every computation is independent of the others, I would like to parallelize this instead of using a for loop.

Upvotes: 1

Views: 47

Answers (0)

Related Questions