pedroivo000
pedroivo000

Reputation: 88

Find all paths starting from source node with Perl

First I would like to clarify that I have very little experience with Graph Theory and the proper algorithms to parse a directed graph, and that I've searched here on SO but didn't quite find what I was looking for. Hopefully you guys can help me :)

I have a large directed graph (around 3000 nodes) that has several subgraphs made out of connected nodes, and the subgraphs are not connected to each other. Here is a small representative graph of the data I have here:

directed graph with unconnected subgraphs

I am writing a Perl script to find all possible paths starting from each source node to the sink nodes and store them in an array of arrays. So, for this graph, the possible paths would be:

1,2,3,4,5,6
1,2,3,4,5,7
1,8,9,10
11,12,13
11,14
15,16,17

The way I've done this search in my script was to use the Graph module in the following steps:

  1. Find all source nodes in the graph and store them in an array
  2. Find all sink nodes in the graph and store them in an array
  3. Find all pairs short paths with the Floyd-Warshall algorithm
  4. Search the APSP Floyd-Warshall graph object if exist a path between a source node and a sink node. If there is a path, store it in array of arrays. If there isn't a path, do nothing.

Here is the part of my script that does it:

#Getting all source nodes in the graph:
my @source_nodes = $dot_graph->source_vertices();
my @sink_nodes = $dot_graph->sink_vertices();

# Getting all possible paths between from source to sink nodes in the graph:
print "Calculating all possible overlaps in graph\n";
my $all_possible_paths = $dot_graph->APSP_Floyd_Warshall();
print "Done\n"; 

# print "Extending overlapping contigs\n"; 
my @all_paths;
foreach my $source (@source_nodes) {
    foreach my $sink (@sink_nodes) {
        my @path_vertices = $all_possible_paths->path_vertices($source, $sink);
        my $path_length = $all_possible_paths->path_length($source,$sink);

        #Saving only the paths that actually exist:
        push (@all_paths, \@path_vertices) unless (!@path_vertices);
    }
}

The problem with that is that it works fine for small graphs, but now that I have 3000 nodes, it would take a very very long time to finish (assuming that each path would take 1ms to be found, it would take 312.5 days to go through all of them). I know using the Floyd-Warshall algorithm to find all possible paths in the graph to only find the paths between sources and sinks is not efficient, but when I wrote the script I needed the results as soon as possible and my graphs were a lot smaller.

My question is how can I find the all paths starting from each source in the graph that will end in a sink node, without find all possible paths first? Is that what is called a breadth-first or a depth-first search? How to implement that with Perl (and if possible, with the Graph module)? Any help would be awesome!

P.S.: In order to make the program run faster, I started trying to breaking the initial big graph into its subgraphs and running the original script, but forking the main loop that searches for the paths between sources and sinks using Parallel::ForkManager. What do you guys think of that approach?

Upvotes: 2

Views: 1042

Answers (1)

ikegami
ikegami

Reputation: 386386

You're not interested in finding the shortest path, so forget about all those shortest path algorithms.

You're interested in finding all paths. This is called tree traversal, and it can be performed depth-first or breadth-first. Since you're traversing the entire tree, it doesn't really matter which approach is taken. The following performs a depth-first search using recursion:

sub build_paths {
   my ($graph) = @_;

   my @paths;

   local *_helper = sub {
      my $v = $_[-1];
      my @successors = $graph->successors($v);
      if (@successors) {
         _helper(@_, $_)
            for @successors;
      } else {
         push @paths, [ @_ ];
      }
   };

   _helper($_)
      for $graph->source_vertices();

   return \@paths;
}

die if $graph->has_a_cycle;

my $paths = build_paths($graph);

Yes, it would be possible to parallelize the work, but I'm not writing that for you.

What concerns me the most is memory. Depending on the number of branches in the paths, you could easily end up running out of memory. Note that storing the paths as strings (of comma-separated values) would take less memory than storing them as arrays.

Upvotes: 5

Related Questions