user1816847
user1816847

Reputation: 2067

what gets "cloned" and "pushed" during git clone and git push

When I run a command such as

git push

or

git push origin master

and my repo looks like

      B--C--D <- master
     /
    A--E--F <- foo-branch

and origin just looks like

A <- master

does push include commits E and F? I understand that typcially it does not include foo-branch, but do all commits still get pushed?

Likewise, when i do

git clone <some-remote-repo>

I know I typically get one branch (seems to be usually master), but do I also have local copies of commits for for other branches, even if I don't get the pointers to their heads?

Upvotes: 2

Views: 280

Answers (3)

VonC
VonC

Reputation: 1323593

The previous answer (Git 2.4.7, Q3 2015, up to Git 2.33, Q3 2021) showed how the bitmaps (the structure which store reachability information about the set of objects in a packfile, or a multi-pack index (MIDX) evolved.

This goes on with Git 2.37:

However, the multi-pack-index code did not protect the packfile (it is going to depend on) from getting removed while in use, which has been corrected with Git 2.37 (Q3 2022).

See commit 4090511, commit 5045759, commit 58a6abb, commit 44f9fd6 (24 May 2022) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 0916804, 03 Jun 2022)

pack-bitmap.c: check preferred pack validity when opening MIDX bitmap

Signed-off-by: Taylor Blau

When pack-objects adds an entry to its packing list, it marks the packfile and offset containing the object, which we may later use during verbatim reuse (c.f., write_reused_pack_verbatim()).

If the packfile in question is deleted in the background (e.g., due to a concurrent git repack(man)), we'll die() as a result of calling use_pack(), unless we have an open file descriptor on the pack itself.
4c08018 ("pack-objects: protect against disappearing packs", 2011-10-14, Git v1.7.8-rc0 -- merge) worked around this by opening the pack ahead of time before recording it as a valid source for reuse.

4c08018's treatment meant that we could tolerate disappearing packs, since it ensures we always have an open file descriptor on any pack that we mark as a valid source for reuse.
This tightens the race to only happen when we need to close an open pack's file descriptor (c.f., the caller of packfile.c::get_max_fd_limit()) and that pack was deleted, in which case we'll complain that a pack could not be accessed and die().

The pack bitmap code does this, too, since prior to dc1daac ("pack-bitmap: check pack validity when opening bitmap", 2021-07-23, Git v2.33.0-rc0 -- merge) it was vulnerable to the same race.

The MIDX bitmap code does not do this, and is vulnerable to the same race.
Apply the same treatment as dc1daac to the routine responsible for opening the multi-pack bitmap's preferred pack to close this race.

This patch handles the "preferred" pack (c.f., the section "multi-pack-index reverse indexes" in Documentation/technical/pack-format.txt) specially, since pack-objects depends on reusing exact chunks of that pack verbatim in reuse_partial_packfile_from_bitmap().
So if that pack cannot be loaded, the utility of a bitmap is significantly diminished.

Similar to dc1daac, we could technically just add this check in reuse_partial_packfile_from_bitmap(), since it's possible to use a MIDX .bitmap without needing to open any of its packs.
But it's simpler to do the check as early as possible, covering all direct uses of the preferred pack.
Note that doing this check early requires us to call prepare_midx_pack() early, too, so move the relevant part of that loop from load_reverse_index() into open_midx_bitmap_1().


Note: See also with Git 2.38 (Q3 2022) the new setting git -c push.useBitmaps=false push, to disable packing for git push.


Git 2.39 (Q4 2022) makes sure to free structures related to delta islands after use.

See commit 7025f54 (17 Nov 2022) by Eric Wong (ele828).
(Merged by Junio C Hamano -- gitster -- in commit a655f28, 23 Nov 2022)

delta-islands: free island-related data after use

Signed-off-by: Eric Wong
Co-authored-by: Ævar Arnfjörð Bjarmason
Signed-off-by: Taylor Blau

On my use case involving 771 islands of Linux on kernel.org, this reduces memory usage by around 25MB.
The bulk of that comes from free_remote_islands, since free_config_regexes only saves around 40k.

This memory is saved early in the memory-intensive pack process, making it available for the remainder of the long process.


With Git 2.42 (Q3 2023), the object traversal using reachability bitmap done by "pack-object" has been tweaked to take advantage of the fact that using "boundary" commits as representative of all the uninteresting ones can save quite a lot of object enumeration.

It reduces the enumeration needed to determine what gets "cloned" and "pushed" during git clone and git push

See commit b0afdce, commit 47ff853, commit fe90355 (08 May 2023) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit f2ffc74, 22 Jun 2023)

pack-bitmap.c: use commit boundary during bitmap traversal

Helped-by: Jeff King
Helped-by: Derrick Stolee
Signed-off-by: Taylor Blau

When reachability bitmap coverage exists in a repository, Git will use a different (and hopefully faster) traversal to compute revision walks.

Consider a set of positive and negative tips (which we'll refer to with their standard bitmap parlance by "wants", and "haves").
In order to figure out what objects exist between the tips, the existing traversal in prepare_bitmap_walk() does something like:

  1. Consider if we can even compute the set of objects with bitmaps, and fall back to the usual traversal if we cannot.
    For example, pathspec limiting traversals can't be computed using bitmaps (since they don't know which objects are at which paths).
    The same is true of certain kinds of non-trivial object filters.
  2. If we can compute the traversal with bitmaps, partition the (dereferenced) tips into two object lists, "haves", and "wants", based on whether or not the objects have the UNINTERESTING flag, respectively.
  3. Fall back to the ordinary object traversal if either (a) there are more than zero haves, none of which are in the bitmapped pack or MIDX, or (b) there are no wants.
  4. Construct a reachability bitmap for the "haves" side by walking from the revision tips down to any existing bitmaps, OR-ing in any bitmaps as they are found.
  5. Then do the same for the "wants" side, stopping at any objects that appear in the "haves" bitmap.
  6. Filter the results if any object filter (that can be easily computed with bitmaps alone) was given, and then return back to the caller.

When there is good bitmap coverage relative to the traversal tips, this walk is often significantly faster than an ordinary object traversal because it can visit far fewer objects.

But in certain cases, it can be significantly slower than the usual object traversal.
Why? Because we need to compute complete bitmaps on either side of the walk.
If either one (or both) of the sides require walking many (or all!) objects before they get to an existing bitmap, the extra bitmap machinery is mostly or all overhead.

One of the benefits, however, is that even if the walk is slower, bitmap traversals are guaranteed to provide an exact answer.
Unlike the traditional object traversal algorithm, which can over-count the results by not opening trees for older commits, the bitmap walk builds an exact reachability bitmap for either side, meaning the results are never over-counted.

But producing non-exact results is OK for our traversal here (both in the bitmap case and not), as long as the results are over-counted, not under.

Relaxing the bitmap traversal to allow it to produce over-counted results gives us the opportunity to make some significant improvements.
Instead of the above, the new algorithm only has to walk from the boundary down to the nearest bitmap, instead of from each of the UNINTERESTING tips.

The boundary-based approach still has degenerate cases, but we'll show in a moment that it is often a significant improvement.

The new algorithm works as follows:

  1. Build a (partial) bitmap of the haves side by first OR-ing any bitmap(s) that already exist for UNINTERESTING commits between the haves and the boundary.
  2. For each commit along the boundary, add it as a fill-in traversal tip (where the traversal terminates once an existing bitmap is found), and perform fill-in traversal.
  3. Build up a complete bitmap of the wants side as usual, stopping any time we intersect the (partial) haves side.
  4. Return the results.

And is more-or-less equivalent to using the old algorithm with this invocation:

$ git rev-list --objects --use-bitmap-index $WANTS --not \
    $(git rev-list --objects --boundary $WANTS --not $HAVES |
      perl -lne 'print $1 if /^-(.*)/')

The new result performs significantly better in many cases, particularly when the distance from the boundary commit(s) to an existing bitmap is shorter than the distance from (all of) the have tips to the nearest bitmapped commit.

Note that when using the old bitmap traversal algorithm, the results can be slower than without bitmaps! Under the new algorithm, the result is computed faster with bitmaps than without (at the cost of over-counting the true number of objects in a similar fashion as the non-bitmap traversal):

# (Computing the number of tagged objects not on any branches
# without bitmaps).
$ time git rev-list --count --objects --tags --not --branches
20

real  0m1.388s
user  0m1.092s
sys   0m0.296s

# (Computing the same query using the old bitmap traversal).
$ time git rev-list --count --objects --tags --not --branches --use-bitmap-index
19

real  0m22.709s
user  0m21.628s
sys   0m1.076s

# (this commit)
$ time git.compile rev-list --count --objects --tags --not --branches --use-bitmap-index
19

real  0m1.518s
user  0m1.234s
sys   0m0.284s

The new algorithm is still slower than not using bitmaps at all, but it is nearly a 15-fold improvement over the existing traversal.

A few other future directions for improving bitmap traversal times beyond not using bitmaps at all:

  • Decrease the cost to decompress and OR together many bitmaps together (particularly when enumerating the uninteresting side of the walk).
    Here we could explore more efficient bitmap storage techniques, like Roaring+Run and/or use SIMD instructions to speed up ORing them together.
  • Store pseudo-merge bitmaps, which could allow us to OR together fewer "summary" bitmaps (which would also help with the above).

git config now includes in its man page:

  • pack.useBitmapBoundaryTraversal=true may improve bitmap traversal times by walking fewer objects.

git config now includes in its man page:

pack.useBitmapBoundaryTraversal

When true, Git will use an experimental algorithm for computing reachability queries with bitmaps. Instead of building up complete bitmaps for all of the negated tips and then OR-ing them together, consider negated tips with existing bitmaps as additive (i.e. OR-ing them into the result if they exist, ignoring them otherwise), and build up a bitmap at the boundary instead.

When using this algorithm, Git may include too many objects as a result of not opening up trees belonging to certain UNINTERESTING commits. This inexactness matches the non-bitmap traversal algorithm.

In many cases, this can provide a speed-up over the exact algorithm, particularly when there is poor bitmap coverage of the negated side of the query.

Upvotes: 0

VonC
VonC

Reputation: 1323593

The way this works internally is that git push runs git pack-objects (with --thin), which then runs git rev-list, passing it the commit IDs you've asked to push

This object-finding can be optimized via bitmaps.

Well, not since With Git 2.4.7 (Q3 2015)

See commit c8a70d3 (01 Jul 2015) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit ace6325, 10 Jul 2015)

rev-list: disable --use-bitmap-index when pruning commits

Signed-off-by: Jeff King

The reachability bitmaps do not have enough information to tell us which commits might have changed path "foo", so the current code produces wrong answers for:

git rev-list --use-bitmap-index --count HEAD -- foo

(it silently ignores the "foo" limiter). Instead, we should fall back to doing a normal traversal (it is OK to fall back rather than complain, because --use-bitmap-index is a pure optimization, and might not kick in for other reasons, such as there being no bitmaps in the repository).

This has been noted in Git 2.26 (Q1 2020): The object reachability bitmap machinery and the partial cloning machinery were not prepared to work well together, because some object-filtering criteria that partial clones use inherently rely on object traversal, but the bitmap machinery is an optimization to bypass that object traversal.

There however are some cases where they can work together, and they were taught about them.

See commit 20a5fd8 (18 Feb 2020) by Junio C Hamano (gitster).
See commit 3ab3185, commit 84243da, commit 4f3bd56, commit cc4aa28, commit 2aaeb9a, commit 6663ae0, commit 4eb707e, commit ea047a8, commit 608d9c9, commit 55cb10f, commit 792f811, commit d90fe06 (14 Feb 2020), and commit e03f928, commit acac50d, commit 551cf8b (13 Feb 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 0df82d9, 02 Mar 2020)

pack-bitmap: refuse to do a bitmap traversal with pathspecs

Signed-off-by: Jeff King

rev-list has refused to use bitmaps with pathspec limiting since c8a70d3509 ("rev-list: disable --use-bitmap-index when pruning commits", 2015-07-01, Git v2.5.0-rc2 -- merge).
But this is true not just for rev-list, but for anyone who calls prepare_bitmap_walk(); the code isn't equipped to handle this case.

We never noticed because the only other callers would never pass a pathspec limiter.

But let's push the check down into prepare_bitmap_walk() anyway. That's a more logical place for it to live, as callers shouldn't need to know the details (and must be prepared to fall back to a regular traversal anyway, since there might not be bitmaps in the repository).

It would also prepare us for a day where this case _is_ handled, but that's pretty unlikely. E.g., we could use bitmaps to generate the set of commits, and then diff each commit to see if it matches the pathspec.
That would be slightly faster than a naive traversal that actually walks the commits.
But you'd probably do better still to make use of the newer commit-graph feature to make walking the commits very cheap.


With Git 2.27 (Q2 2020), the object walk with object filter "--filter=tree:0" can now take advantage of the pack bitmap when available.

See commit 9639474, commit 5bf7f1e (04 May 2020) by Jeff King (peff).
See commit b0a8d48, commit 856e12c (04 May 2020) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 69ae8ff, 13 May 2020)

pack-bitmap.c: make object filtering functions generic

Signed-off-by: Taylor Blau

In 4f3bd5606a ("pack-bitmap: implement BLOB_NONE filtering", 2020-02-14, Git v2.26.0-rc0 -- merge listed in batch #8), filtering support for bitmaps was added for the 'LOFC_BLOB_NONE' filter.

In the future, we would like to add support for filters that behave as if they exclude a certain type of object, for e.g., the tree depth filter with depth 0.

To prepare for this, make some of the functions used for filtering more generic, such as 'find_tip_blobs' and 'filter_bitmap_blob_none' so that they can work over arbitrary object types.

To that end, create 'find_tip_objects' and 'filter_bitmap_exclude_type', and redefine the aforementioned functions in terms of those.


With Git 2.32 (Q2 2021), optimize "rev-list --use-bitmap-index --objects(man) corner case that uses negative tags as the stopping points.

That participates to describe what gets "cloned" and "pushed" during git clone and git push, paying this time attention to tags:

See commit 540cdc1 (22 Mar 2021) by Patrick Steinhardt (pks-t).
(Merged by Junio C Hamano -- gitster -- in commit 58840e6, 07 Apr 2021)

pack-bitmap: avoid traversal of objects referenced by uninteresting tag

Signed-off-by: Patrick Steinhardt [email protected].

When preparing the bitmap walk, we first establish the set of of have and want objects by iterating over the set of pending objects: if an object is marked as uninteresting, it's declared as an object we already have, otherwise as an object we want.
These two sets are then used to compute which transitively referenced objects we need to obtain.

One special case here are tag objects: when a tag is requested, we resolve it to its first not-tag object and add both resolved objects as well as the tag itself into either the have or want set.
Given that the uninteresting-property always propagates to referenced objects, it is clear that if the tag is uninteresting, so are its children and vice versa.
But we fail to propagate the flag, which effectively means that referenced objects will always be interesting except for the case where they have already been marked as uninteresting explicitly.

This mislabeling does not impact correctness: we now have it in our "wants" set, and given that we later do an AND NOT of the bitmaps of "wants" and "haves" sets it is clear that the result must be the same.
But we now start to needlessly traverse the tag's referenced objects in case it is uninteresting, even though we know that each referenced object will be uninteresting anyway.
In the worst case, this can lead to a complete graph walk just to establish that we do not care for any object.

Fix the issue by propagating the UNINTERESTING flag to pointees of tag objects and add a benchmark with negative revisions to p5310.
This shows some nice performance benefits, tested with linux.git:

Test                                                          HEAD~                  HEAD
---------------------------------------------------------------------------------------------------------------
5310.3: repack to disk                                        193.18(181.46+16.42)   194.61(183.41+15.83) +0.7%
5310.4: simulated clone                                       25.93(24.88+1.05)      25.81(24.73+1.08) -0.5%
5310.5: simulated fetch                                       2.64(5.30+0.69)        2.59(5.16+0.65) -1.9%
5310.6: pack to file (bitmap)                                 58.75(57.56+6.30)      58.29(57.61+5.73) -0.8%
5310.7: rev-list (commits)                                    1.45(1.18+0.26)        1.46(1.22+0.24) +0.7%
5310.8: rev-list (objects)                                    15.35(14.22+1.13)      15.30(14.23+1.07) -0.3%
5310.9: rev-list with tag negated via --not --all (objects)   22.49(20.93+1.56)      0.11(0.09+0.01) -99.5%
5310.10: rev-list with negative tag (objects)                 0.61(0.44+0.16)        0.51(0.35+0.16) -16.4%
5310.11: rev-list count with blob:none                        12.15(11.19+0.96)      12.18(11.19+0.99) +0.2%
5310.12: rev-list count with blob:limit=1k                    17.77(15.71+2.06)      17.75(15.63+2.12) -0.1%
5310.13: rev-list count with tree:0                           1.69(1.31+0.38)        1.68(1.28+0.39) -0.6%
5310.14: simulated partial clone                              20.14(19.15+0.98)      19.98(18.93+1.05) -0.8%
5310.16: clone (partial bitmap)                               12.78(13.89+1.07)      12.72(13.99+1.01) -0.5%
5310.17: pack to file (partial bitmap)                        42.07(45.44+2.72)      41.44(44.66+2.80) -1.5%
5310.18: rev-list with tree filter (partial bitmap)           0.44(0.29+0.15)        0.46(0.32+0.14) +4.5%

While most benchmarks are probably in the range of noise, the newly added 5310.9 and 5310.10 benchmarks consistently perform better.


With Git 2.32 (Q2 2021), a configuration variable has been added to force tips of certain refs to be given a reachability bitmap.

See commit 3f267a1, commit 483fa7f, commit dff5e49 (31 Mar 2021) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 0623669, 13 Apr 2021)

t/helper/test-bitmap.c: initial commit

Signed-off-by: Taylor Blau

Add a new 'bitmap' test-tool which can be used to list the commits that have received bitmaps.

In theory, a determined tester could run 'git rev-list --test-bitmap <commit>'(man) to check if '<commit>' received a bitmap or not, since '--test-bitmap' exits with a non-zero code when it can't find the requested commit.

But this is a dubious behavior to rely on, since arguably 'git rev-list' could continue its object walk outside of which commits are covered by bitmaps.

This will be used to test the behavior of 'pack.preferBitmapTips'

And:

builtin/pack-objects.c: respect 'pack.preferBitmapTips'

Suggested-by: Jeff King
Signed-off-by: Taylor Blau

When writing a new pack with a bitmap, it is sometimes convenient to indicate some reference prefixes which should receive priority when selecting which commits to receive bitmaps.

A truly motivated caller could accomplish this by setting 'pack.islandCore', (since all commits in the core island are similarly marked as preferred) but this requires callers to opt into using delta islands, which they may or may not want to do.

Introduce a new multi-valued configuration, 'pack.preferBitmapTips' to allow callers to specify a list of reference prefixes.
All references which have a prefix contained in 'pack.preferBitmapTips' will mark their tips as "preferred" in the same way as commits are marked as preferred for selection by 'pack.islandCore'.

The choice of the verb "prefer" is intentional: marking the NEEDS_BITMAP flag on an object does not guarantee that that object will receive a bitmap.
It merely guarantees that that commit will receive a bitmap over any other commit in the same window by bitmap_writer_select_commits().

The test this patch adds reflects this quirk, too.
It only tests that a commit (which didn't receive bitmaps by default) is selected for bitmaps after changing the value of 'pack.preferBitmapTips' to include it.
Other commits may lose their bitmaps as a byproduct of how the selection process works (bitmap_writer_select_commits() ignores the remainder of a window after seeing a commit with the NEEDS_BITMAP flag).

This configuration will aide in selecting important references for multi-pack bitmaps, since they do not respect the same pack.islandCore configuration.
(They could, but doing so may be confusing, since it is packs--not bitmaps--which are influenced by the delta-islands configuration).

In a fork network repository (one which lists all forks of a given repository as remotes), for example, it is useful to set pack.preferBitmapTips to 'refs/remotes/<root>/heads' and 'refs/remotes/<root>/tags', where '<root>' is an opaque identifier referring to the repository which is at the base of the fork chain.

git config now includes in its man page:

pack.preferBitmapTips

When selecting which commits will receive bitmaps, prefer a commit at the tip of any reference that is a suffix of any value of this configuration over any other commits in the "selection window".

Note that setting this configuration to refs/foo does not mean that the commits at the tips of refs/foo/bar and refs/foo/baz will necessarily be selected. This is because commits are selected for bitmaps from within a series of windows of variable length.

If a commit at the tip of any reference which is a suffix of any value of this configuration is seen in a window, it is immediately given preference over any other commit in that window.


With Git 2.33 (Q3 2021), avoid duplicated work while building reachability bitmaps.

See commit aa9ad6f (14 Jun 2021) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 1ef488e, 08 Jul 2021)

bitmaps: don't recurse into trees already in the bitmap

Signed-off-by: Jeff King

If an object is already mentioned in a reachability bitmap we are building, then by definition so are all of the objects it can reach.
We have an optimization to stop traversing commits when we see they are already in the bitmap, but we don't do the same for trees.

It's generally unavoidable to recurse into trees for commits not yet covered by bitmaps (since most commits generally do have unique top-level trees).
But they usually have subtrees that are shared with other commits (i.e., all of the subtrees the commit _didn't_ touch).
And some of those commits (and their trees) may be covered by the bitmap.

Usually this isn't too big a deal, because we'll visit those subtrees only once in total for the whole walk.
But if you have a large number of unbitmapped commits, and if your tree is big, then you may end up opening a lot of sub-trees for no good reason.

We can use the same optimization we do for commits here: when we are about to open a tree, see if it's in the bitmap (either the one we are building, or the "seen" bitmap which covers the UNINTERESTING side of the bitmap when doing a set-difference).

This works especially well because we'll visit all commits before hitting any trees.
So even in a history like:

A -- B

if "A" has a bitmap on disk but "B" doesn't, we'll already have OR-ed in the results from A before looking at B's tree (so we really will only look at trees touched by B).

For most repositories, the timings produced by p5310 are unspectacular.

Any improvement there is within the noise (the +3.1% on test 7 has to be noise, since we are not recursing into trees, and thus the new code isn't even run).
The results for git.git are likewise uninteresting.

But here are numbers from some other real-world repositories (that are not public).
This one's tree is comparable in size to linux.git, but has ~16k refs (and so less complete bitmap coverage):

Test                         HEAD^               HEAD
-------------------------------------------------------------------------
5310.4: simulated clone      38.34(39.86+0.74)   33.95(35.53+0.76) -11.5%
5310.5: simulated fetch      2.29(6.31+0.35)     2.20(5.97+0.41) -3.9%
5310.7: rev-list (commits)   0.99(0.86+0.13)     0.96(0.85+0.11) -3.0%
5310.8: rev-list (objects)   11.32(11.04+0.27)   6.59(6.37+0.21) -41.8%

And here's another with a very large tree (~340k entries), and a fairly large number of refs (~10k):

Test                         HEAD^               HEAD
-------------------------------------------------------------------------
5310.3: simulated clone      53.83(54.71+1.54)   39.77(40.76+1.50) -26.1%
5310.4: simulated fetch      19.91(20.11+0.56)   19.79(19.98+0.67) -0.6%
5310.6: rev-list (commits)   0.54(0.44+0.11)     0.51(0.43+0.07) -5.6%
5310.7: rev-list (objects)   24.32(23.59+0.73)   9.85(9.49+0.36) -59.5%

This patch provides substantial improvements in these larger cases, and have any drawbacks for smaller ones (the cost of the bitmap check is quite small compared to an actual tree traversal).


And still Git 2.33: with A race between repacking and using pack bitmaps has been corrected with Git 2.33 (Q3 2021).

See commit dc1daac (23 Jul 2021) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 9bcdaab, 02 Aug 2021)

pack-bitmap: check pack validity when opening bitmap

Signed-off-by: Jeff King

When pack-objects adds an entry to its list of objects to pack, it may mark the packfile and offset that contains the file, which we can later use to output the object verbatim.
If the packfile is deleted while we are running (e.g., by another process running "git repack"(man)), we may die in use_pack() if the pack file cannot be opened.

We worked around this in 4c08018 ("pack-objects: protect against disappearing packs", 2011-10-14, Git v1.7.8-rc0 -- merge) by making sure we can open the pack before recording it as a source.
This detects a pack which has already disappeared while generating the packing list, and because we keep the pack's file descriptor (or an mmap window) open, it means we can access it later (unless you exceed core.packedgitlimit).

The bitmap code that was added later does not do this; it adds entries to the packlist without checking that the packfile is still valid, and is vulnerable to this race.
It needs the same treatment as 4c08018.

However, rather than add it in just that one spot, it makes more sense to simply open and check the packfile when we open the bitmap.
Technically you can use the .bitmap without even looking in the .pack file (e.g., if you are just printing a list of objects without accessing them), but it's much simpler to do it early.
That covers all later direct uses of the pack (due to the cached descriptor) without having to check each one directly.
For example, in pack-objects we need to protect the packlist entries, but we also access the pack directly as part of the reuse_partial_pack_from_bitmap() feature.
This patch covers both cases.

Upvotes: 1

torek
torek

Reputation: 487883

It's partly transport-dependent: git has "dumb transports" (such as using http to transfer one object at a time) and "smart transports" (using the git:// or ssh:// protocols, where two gits negotiate with each other, then—provided that the receiver indicates that it's OK—the sender builds a "thin pack").

It's also partly command-dependent: for instance, if you ask for a "shallow" clone, or a single branch, you generally get less than if you do a "normal" clone. And, when you run git push, you can choose which particular commit IDs, if any, you deliver originally to the remote repository, and what branch-name(s) you'd like them to use.

Let's ignore the shallow and single-branch clones for now, though.

Given your example of:

  B--C--D  <- master
 /
A--E--F    <- foo-branch

and git push origin master (whose refspec is presumably equivalent to master:master, i.e., you have not configured an unusual push), where your remote origin currently has commit A (it doesn't matter what branch label(s) it has for A, only that it has A) and assuming a smart protocol, the handshake and transfer protocol starts out pretty much like this:

(your git) "what options do your support? I have thin-packs etc"
(their git) "I have thin-packs and ofs-delta and so on"
(your git) "ok, send me all your refs and their SHA-1s"
(their git) "refs/heads/master is <SHA-1 of A>"
(their git) "that's all I have"

At this point, your git knows what commits are required to get all the commits to the remote: these are the commits that would be listed if you ran, in your repository, git rev-list master ^A (fill in the actual SHA-1 of A, of course). There is no need to exclude additional SHA-1s as the remote origin has nothing but the one branch, whose tip is commit A.

The way this works internally is that git push runs git pack-objects (with --thin), which then runs git rev-list, passing it the commit IDs you've asked to push, with exclusions (--not or prefix ^) for all the commit IDs their git sent you (again in our case that's just the one commit-ID A). See the documentation for git rev-list, paying particular attention to the --objects-edge option (or --objects-edge-aggressive when working with shallow clones).

Your git rev-list therefore outputs the ID of commit D, plus the IDs of its tree and all of that tree's subtrees and blobs, unless it concludes (via the negated IDs, in this case the ^A that excludes commit A) that the remote git must already have them. It then outputs the ID of commit C and its tree, with the same "unless" condition, and so on. Note that commit A has a source tree associated with it; and suppose commit C has the same tree—for instance, suppose commit C is a revert of B. In this case there's no need to send C's tree: the remote must have it because the remote has commit A.

(This object-finding can be optimized via bitmaps. There's a github blog post, I think, describing the development of these bitmaps, which were a solution to the rather slow process of traversing lots of commit graphs so as to find which objects must already be in some remote repository based on some branch tip IDs. This helps them enormously because the fetch process across a smart protocol is symmetric with that of push: we simply swap send and receive roles.)

In any case, the output from your git rev-list feeds your git pack-objects --thin. This provides all the object IDs to take (commit D, its tree if needed, and any needed subtrees and blobs; commit C and needed objects; commit B and needed objects), and also IDs specifically not to take: commit A and its objects, and if there were commits before A, those and their objects. The pack-objects step makes a delta-compressed pack in which the "take these objects" objects are compressed against the "don't take these other objects" objects.

As a super-simplified example, suppose that the tree for A includes a 10 MB file whose last line is "The end". Suppose that the tree for B has a file that's almost the same, except the words "The end" are removed. Git can compress this file into the instructions "start with blob <id-of-file>, then remove the last line." These instructions are much less than 10 MB long and are allowed in the "thin pack".

It's this "thin pack" that is sent over the Internet-phone connection (or whatever datawire connects the two git instances). The receiver then "thickens" the pack into normal git packs (normal packs do not allow delta-compression against an object that is not already in the pack).


OK, that's quite long, but it boils down to: your git won't send F (because you didn't ask it to), nor E (because you're not sending F), nor will it look at the two trees attached to those two commits. But this does depend on the exact command you use, and whether you're using a smart protocol.

If you run git clone without --single-branch, your clone operation starts by calling up the remote as usual, and getting a list of all that remote's references (just like push!). To see these, use git ls-remote:

From git://git.kernel.org/pub/scm/git/git.git
aa826b651ae3012d1039453b36ed6f1eab939ef9    HEAD
fdca2bed90a7991f2a3afc6a463e45acb03487ac    refs/heads/maint
aa826b651ae3012d1039453b36ed6f1eab939ef9    refs/heads/master
595b96af80404335de2a8c292cee81ed3da24d29    refs/heads/next
60feb01a0d7c7d54849c233d2824880c57ff9e94    refs/heads/pu
7af04ad560ab8edb07b498d442780a6a794162b0    refs/heads/todo
d5aef6e4d58cfe1549adef5b436f3ace984e8c86    refs/tags/gitgui-0.10.0
3d654be48f65545c4d3e35f5d3bbed5489820930    refs/tags/gitgui-0.10.0^{}

[hundreds more snipped]

Your git then requests just about everything from the remote. (In this case the "just about" is unnecessary, but if they present you with refs/ other than heads/ and tags/ you might not get those. You also get some control over what tags your git brings over. The details here are a bit messy, but in most normal repositories, a clone will bring over all the tags.)

You're tripping over a faulty assumption when you say this:

I know I typically get one branch (seems to be usually master), but do I also have local copies of commits for for other branches, even if I don't get the pointers to their heads?

Your git asks for, and gets, all their branches. But your git renames them too. They're all renamed to live within the refs/remotes/ name-space, under the name of the remote (normally origin, but -o <name> or --origin <name> changes this). Their refs/heads/master becomes your refs/remotes/origin/master; their refs/heads/maint becomes your refs/remotes/origin/maint; and so on.

You will see all of these (abbreviated somewhat) by running git branch -r, which tells git branch to show remote-tracking branches. (And again, "remote-tracking branches" are just those branches whose full name starts with refs/remotes/. A git fetch from a particular remote updates the corresponding remote-tracking branches via the fetch = directives in the repo's configuration entry for that remote.)

The master that you see if you run git branch or git status is actually created as a last step in your clone. It doesn't actually run git checkout—it has the same code built in directly—but in essence, your clone, as its final operation, runs git checkout branch-or-sha1 for some branch name (or, as a last ditch attempt, a raw SHA-1 giving a "detached HEAD"). The name used is:

  • the one you supplied as an argument to git clone, or
  • the branch that the remote git's HEAD points to, if your branch can figure this out, or if it was provided during protocol negotiation.1

If those fail—and assuming you didn't instruct the clone process not to do a checkout—git clone does a checkout of the raw SHA-1 it got from the remote as the remote's HEAD. (In the example ls-remote output above this is aa826b651ae3012d1039453b36ed6f1eab939ef9.)


1Note that HEAD comes across as a raw SHA-1. For a long time, there was a bug in git where, if this SHA-1 corresponded to at least two branch names, git clone didn't know which branch to check out. Because smart protocols start by negotiating options, though, the git folks were able to add an option by which one git tells another "HEAD points to branch X". So now, even if the imported HEAD matches multiple imported refs/heads/* names, git can tell which one to use.

Upvotes: 3

Related Questions