Reputation: 11
I have a large (~10 million edges and some ~100k nodes) Directed Acyclic Graph (DAG) and a list of walkers (around ~30k), and each of these walker has an origin and destination node which get connected in the DAG with some additional computation to determine edges (usually ~50 edges per walker). For each walker, I need then to efficiently compute the shortest path, possibly in a scalable parallel manner.
I am fairly fluent only in Python, so I tried a first approach using Dask for parallelization. This worked well enough for the computation of the additional edges; but when doing the actual shortest path computation, it is crucial that I use graph-tool's fast C++ implementation of the algorithm optimized for DAGs.
Here unfortunately it gets a bit unclear to me how to efficiently handle sharing the "base" large DAG among Dask Workers, and then do the node/edges adding and shortest path computing function mapped to each walker.
Also, I am currently working with a single Linux VM with 8 threads and I think enough memory to hold all the computations for each thread, but I can use multiple VMs and I would like to enjoy the possible speed up.
Edit:
What I would do as a for loop in Python is:
import graph_tool as gt
# instantiate the graph_tool graph
g = gt.Graph()
# add edges with weight
g.ep['weight'] = g.new_edge_property('double')
g.add_edge_list(edges[['src','dst','weight']].values,eprops=[g.ep['weight']])
for walker in walkers:
origin_node = g.add_vertex()
destination_node = g.add_vertex()
origin_edges = function_to_compute_origin_edges(walker)
destination_edges = function_to_compute_destination_edges(walker)
g.add_edge_list([origin_edges,destination_edges])
vertex_list,edge_list = gt.topology.shortest_path(g,origin_node,destination_node,weights=g.ep['weight'],dag=True)
g.remove_vertices([origin_node,destination_node])
Since every computation is independent of the others, I would like to parallelize this instead of using a for loop.
Upvotes: 1
Views: 47