Pieter
Pieter

Reputation: 4040

Incremental linearizing of git DAG

I'm the author of GitX. One of the features GitX has is the visualization of branches, as can be seen here.

This visualization is currently done by reading commits which are emitted from git in the correct order. For each commit the parents are known, so it's fairly easy to build up the lanes in the correct way.

I'd like to speed up this process by using my own commit pool and linearizing the commits myself. This allows me to reuse existing loaded commits and allows git to emit commits faster because it doesn't have to emit them in the correct order.

However, I'm not sure what algorithm to use to accomplish this. It is important that the building is incremental, as the loading of commits can take a long time (>5 seconds for 100,000 commits, which should all be displayed).

Gitk has gone the same way, and there's a patch here that shows how it is implemented, but my TCL skills are weak and the patch isn't very thoroughly commented and a bit hard to follow.

I'd also like this algorithm to be efficient, as it'll have to handle hundreds of thousands of commits. It also has to be displayed in a table, so it's important that access to specific rows is fast.

I'll describe the input I have so far, the output that I want and a few observations.

Input:

Output:

A few remarks:

Upvotes: 12

Views: 1358

Answers (4)

Macke
Macke

Reputation: 25710

Do you really need to display 100k commits at once? What kind of user can soak up that kind of info?

Have you thought about paging? I.e just compute for ~100 commits or something. If a branch-line goes way back (off-page), you could use something like Github's back-pointing arrow to show that.

Upvotes: 0

C Pirate
C Pirate

Reputation: 454

OK, so I'm having a similarly hard time reading the entirety of that patch, but let's see if I can piece it together from what I did figure out.

To start with, gitk simplifies things by condensing a string of commits into an arc, containing a series of commits that each only have one parent and one child. Aside from anything else, doing this should cut down pretty dramatically on the number of nodes you have to consider for your sort, which will help out any algorithm you use. As a bonus, related commits will end up grouped together.

This does introduce some complexity in terms of finding an arc when you read a new commit. There are a few situations:

  • The new commit has a single parent, or no parents. It extends a (possibly empty) arc. Most of the time, you'll just extend the most recent arc. There are a few interesting subcases:
    • It may cause an existing arc to be split, if its parent already has a child (i.e. its parent turns out to be a branch point, which I gather you don't know ahead of time).
    • It could be a "missing link" that connects two arcs together.
    • You may already know that this commit has multiple children
  • The new commit has multiple parents (a merge commit).

You may want to include the multi-child or multi-parent commits in arcs, or it may make more sense to keep them separate. Either way, it shouldn't be too difficult to build up this set of arcs incrementally.

Once you have these arcs, you're still left with trying to linearize them. In your case, the first algorithm described on the aforementioned Wikipedia page sounds useful, as you have a known set of branch points to use as your initial set S.

Other notes:

  • Relocating commits should be manageable. First of all, you only have to care when you connect two arcs, either through a new merge commit, a newly-discovered branch point, or combining two arcs into one. Any given arc can easily maintain its current row number range (assuming you're fine with putting an arc on sequential rows), so traversing up the tree checking that all new ancestors show up later should be pretty quick.
  • I don't know enough to say much about drawing the graph lines, but I imagine it won't be too different from what you do now.

Anyway, I hope that helps. It was interesting to think about, at least.

Upvotes: 3

obecalp
obecalp

Reputation: 1691

The standard topological sort is O(n) (OK, O(V+E)), i.e. you should be able to sort a million commits in memory in a fraction of a second. No incremental hack like those in Tcl is needed.

BTW, I use GitX (looks much better than Gitk on OS X) everyday and don't have any issue with it (maybe because I don't have those crazy merges in my repositories) :)

Upvotes: 6

Paul
Paul

Reputation: 16315

I haven't used GitX, so maybe I'm missing something, but it seems like you could walk back from child to parent(s) from the head of each current branch until you can draw a few screens of the graph.

That might not give you the optimal visual layout of branches that are rooted earlier. But it seems like responsiveness would be more important than waiting to draw a graph with the fewest crossings, since most users are likely to be interested in recent activity.

Upvotes: -2

Related Questions