liadan
liadan

Reputation: 21

How does git fetches commits associated to a file?

I'm writing a simple parser of .git/* files. I covered almost everything, like objects, refs, pack files etc. But I have a problem. Let's say I have a big 300M repository (in a pack file) and I want to find out all the commits which changed /some/deep/inside/file file. What I'm doing now is:

And I repeat it over and over until I reach very first commit.

This solution works, but it sucks. In worse case scenario, first search can take even 3 minutes (for 300M pack).

Is there any way to speed it up ? I tried to avoid putting so large objects in memory, but right now I don't see any other way. And even that, initial memory load will take forever :(

Greets and thanks for any help!

Upvotes: 2

Views: 89

Answers (1)

araqnid
araqnid

Reputation: 133792

That's the basic algorithm that git uses to track changes to a particular file. That's why "git log -- some/path/to/file.txt" is a comparatively slow operation, compared to many other SCM systems where it would be simple (e.g. in CVS, P4 et al each repo file is a server file with the file's history).

It shouldn't take so long to evaluate though: the amount you ever have to keep in memory is quite small. You already mentioned the main point: remember the tree IDs going down to the path to quickly eliminate commits that didn't even touch that subtree. It's rare for tree objects to be very big, just like directories on a filesystem (unsurprisingly).

Are you using the pack index? If you're not, then you essentially have to unpack the entire pack to find this out since trees could be at the end of a long delta chain. If you have an index, you'll still have to apply deltas to get your tree objects, but at least you should be able to find them quickly. Keep a cache of applied deltas, since obviously it's very common for trees to reuse the same or similar bases- most tree object changes are just changing 20 bytes from a previous tree object. So if in order to get tree T1, you have to start with object T8 and apply Td7 to get T7, T6.... etc. it's entirely likely that these other trees T2-8 will be referenced again.

Upvotes: 1

Related Questions