Ingwie Phoenix
Ingwie Phoenix

Reputation: 2983

Topological sorting of C++ data structures

Before I start, I first have to mention that with the term "graph" I'll refer to an image displaying a structure. Due to my visual impairment, I can neither imagine nor draw them. I can look at them and understand - but I have a difficult time making them up myself.

I am working on a build tool that uses a scripting language to generate targets, which are processed and split into tasks. A task is represented as a data structure: https://github.com/IngwiePhoenix/IceTea/blob/master/src/main.cpp#L98

As you can see, the Task class takes note of it is a master (the actual result of the process) and stores a reference - note, one reference only, not an array - to its child and parent - if any.

So far I am able to fill the task queue with all required tasks and send it into the executor void Run(void*): https://github.com/IngwiePhoenix/IceTea/blob/master/src/main.cpp#L1107

But here come the problems:

Currently, the program is structured with a thread pool that pulls tasks from the queue and runs them. But after I have investigated more into the Ninja tool, this logic might change; there will be a thread pool that only executes commands generated from the scripting language. However, this has to wait, for now. Therefore, I am making the tool run on a single thread only, to simulate the behavior that I actually want; one command running after another.

The problem that I am mainly facing though, stays to be the sorting mechanism. From my studies on build tools, they tend to use what's called a DAG and topological sorting. Now, neither could I find a good explanation on how to write a topological sorting algorithm nor did I figure out how it works at all. I know that there are the constants u and v. But I am unable to find a way to implement it.

So far, I have understood that it is a linear graph. Let's assume the following structure:

myApp.exe
    - main.o
    - util.o
    | libfoo.a
        - foo_a.o
        - foo_b.o

This is very straight forward. There is one result, and one dependency. But what if we have two results depending on the same libfoo?

myApp.exe       libbar.so
    - main.o        - barlib.o
    - util.o        
-------------------------
            | libfoo.a
                - foo_a.o
                - foo_b.o

And that is where I am already stuck at.

Can you maybe explain to me how to implement a topological algorithm, and what a DAG actually is? It would be really helpful, because I am very honestly hitting a barrier here that is hard to overcome. Thank you in advance!

A side-note: I want to keep the tool as small as I can, therefore I can not add stuff like Boost.Graph, which I have seen as a search result.

Upvotes: 2

Views: 2301

Answers (1)

Kittsil
Kittsil

Reputation: 2477

Okay, first things first: A DAG is a Directed Acyclic Graph.

  • A graph is a data structure that has nodes which are connected in some way. For instance, a map is a graph, where the intersections are the nodes and the roads are the edges.

  • A directed graph is a graph with some sort of direction on the edges; for instance, many roads are undirected (please stay on the correct side of the road), but some are one-way.

  • An acyclic graph is a graph without any cycles; that means that once you leave a node, you can't get back to it.

You can't sort a cyclic graph; which part of the cycle would come first? But you can sort an acyclic graph, and the result is a 'topological sort.' As you pointed out (maybe accidentally) in your second example, multiple topological sortings may be valid for the same graph.

In order to handle the dependencies in your problem, you will have to use a graph. However, since you know that it will be that it will be acyclic, you can write your own, very concise graph class.

struct node {
  string name;
  vector<node*> out_neighbors;
  vector<node*> in_neighbors;
}

You would store all the nodes in a single array, graph, and the nodes keep a list of the nodes that come after them in out_neighbors (this is called an "adjacency list" format).

Now how do you sort them? Well, you want them sorted so that nodes do not depend on nodes that come later. First, you want to find all the nodes that have no incoming edges (i.e., don't depend on anything).

queue<node*> free;
for(int i=0; i<n; ++i)
  if(graph[i].in_neighbors.size() == 0)
    free.push(&graph[i]);

Next, you are going to use these to find other nodes.

queue<node*> topsort;         //this is going to be the sorted nodes
while(free.size() > 0) {
  node* top = free.front();   //remove the first element from the graph
  topsort.push(top);
  free.pop();

  for(int i=0; i<top->out_neighbors.size(); ++i) {
    node* neighbor = top->out_neighbors[i];
    neighbor->in_neighbors.erase(
      find(neighbor->in_neighbors.begin(), neighbor->in_neighbors.end(), top)
    );
    if(neighbor->in_neighbors.size() == 0)
      free.push(neighbor);
  }
}

At the end of this, topsort will be a list of the node pointers in a sorted topological order. However, all of your incoming edges have been removed from the graph, as well (this can easily be rebuilt using the outgoing edges).


Some suggestions:

First, your Tasks are already like the simple structure I described here, except that they only have one parent and child pointer; why not have a list of parents (incoming) and children (outgoing)? Then you can sort the Task objects themselves.

Second, once it is sorted topologically, you can run things in threads; so long as all of my incoming edges have been compiled, I'm free to go, too.

Upvotes: 1

Related Questions